bsmn / bsmn-pipeline

BSMN common data processing pipeline
11 stars 10 forks source link

test existing pipeline implementation #19

Open kdaily opened 5 years ago

kdaily commented 5 years ago

I'm going to open this issue to track the work related to the first milestone, which is getting the existing pipeline working for other people besides @bintriz!

I'm going to assign it to @bintriz and @attilagk as they are the ones working the most on it at this point.

Please do open specific issues related to bugs, feature requests, etc. that arise.

attilagk commented 5 years ago

Hi @bintriz

I am having a hard time to understand your sample_list.txt format. I am looking at your code in genome_mapping/run.py but still have the following questions

  1. Why do you have BAMs in the file_name field in your sample_list.txt example? For realignment? Isn't that the input file name? A more typical input it would be FASTQ, isn't it?
  2. At what location and with what name will the output BAM be created?
bintriz commented 5 years ago

The pipeline accepts both kinds of files, bam and fastq, as input. If the input file type is bam, it convert bam into fastq files and then proceed forward. 3 kinds of location are possible. At the first time, it only accepted synapse id as location as it used the synapse client as interface to download input files. Later, I added the interface to download files directly from the NDA using the aws client. And also added the function to use local input files.

The location where output files to be uploaded is controlled as as parentid option. If that option is turned on, output files are uploaded to certain synapse folder which is specified as synapse id and deleted. Without this option specified, output files are stayed in local. In the case you use AWS, the occupied space will be charged. That's the reason I added this option as the final step.

If you rephrase the README file under your point of view, it would be great!

attilagk commented 5 years ago

@bintriz I ran genome_mapping.sh but the run failed.

The command was

[ec2-user@ip-172-31-4-155 common-sample]$ /shared/bsmn_pipeline/genome_mapping.sh sample_list.txt

where sample_list.txt looks like this:

#sample_id      file_name       location
sample1 CSNeuNP_S7_L001_R1-small.fq.gz  syn17931354

This failed with the following errors

- Check synapse login
Welcome, Attila Gulyás-Kovács!

- Check NDA login
Requesting token succeeded!

sample1
Traceback (most recent call last):
  File "/shared/bsmn_pipeline/genome_mapping/run.py", line 102, in <module>
    main()
  File "/shared/bsmn_pipeline/genome_mapping/run.py", line 40, in main
    jid_list.append(submit_pre_jobs_fastq(sample, fname, loc))
  File "/shared/bsmn_pipeline/genome_mapping/run.py", line 54, in submit_pre_jobs_fastq
    job_home=job_home, sample=sample, fname=fname, loc=loc))
  File "/shared/bsmn_pipeline/library/job_queue.py", line 87, in submit
    jid = m.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

The run_info is

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#PATH
PIPE_HOME=/shared/bsmn_pipeline

#TOOLS
PYTHON3=/shared/bsmn_pipeline/tools/python/3.6.2/bin/python3
SYNAPSE=/shared/bsmn_pipeline/tools/python/3.6.2/bin/synapse
AWS=/shared/bsmn_pipeline/tools/python/3.6.2/bin/aws
JAVA=/shared/bsmn_pipeline/tools/java/jdk1.8.0_191/bin/java
BWA=/shared/bsmn_pipeline/tools/bwa/0.7.16a/bin/bwa
SAMTOOLS=/shared/bsmn_pipeline/tools/samtools/1.7/bin/samtools
SAMBAMBA=/shared/bsmn_pipeline/tools/sambamba/v0.6.7/bin/sambamba
GATK=/shared/bsmn_pipeline/tools/gatk/3.7-0/GenomeAnalysisTK.jar
PICARD=/shared/bsmn_pipeline/tools/picard/2.12.1/picard.jar
BGZIP=/shared/bsmn_pipeline/tools/htslib/1.7/bin/bgzip
TABIX=/shared/bsmn_pipeline/tools/htslib/1.7/bin/tabix
VT=/shared/bsmn_pipeline/tools/vt/2018-06-07/bin/vt
BCFTOOLS=/shared/bsmn_pipeline/tools/bcftools/1.7/bin/bcftools
ROOTSYS=/shared/bsmn_pipeline/tools/root/6.14.00
CNVNATOR=/shared/bsmn_pipeline/tools/cnvnator/2018-07-09/bin/cnvnator

#RESOURCES
REFDIR=/shared/bsmn_pipeline/resources
REF=/shared/bsmn_pipeline/resources/hs37d5.fa
DBSNP=/shared/bsmn_pipeline/resources/dbsnp_138.b37.vcf
MILLS=/shared/bsmn_pipeline/resources/Mills_and_1000G_gold_standard.indels.b37.vcf
INDEL1KG=/shared/bsmn_pipeline/resources/1000G_phase1.indels.b37.vcf
OMNI=/shared/bsmn_pipeline/resources/1000G_omni2.5.b37.vcf
HAPMAP=/shared/bsmn_pipeline/resources/hapmap_3.3.b37.vcf
SNP1KG=/shared/bsmn_pipeline/resources/1000G_phase1.snps.high_confidence.b37.vcf
KNOWN_GERM_SNP=/shared/bsmn_pipeline/resources/gnomAD.1KG.ExAC.ESP6500.Kaviar.snps.txt.gz
MASK1KG=/shared/bsmn_pipeline/resources/20141020.strict_mask.whole_genome.fasta.gz

#SYNAPSE
PARENTID=None
bintriz commented 5 years ago

Output file name is determined by sample name in the sample_list file. If you'd like to use fastq files as input, you should put R1 and R2 files together. This pipeline group all inputs by sample name. So, if you have multiple input files due separate library preparation or multiple sequencing runs, please just put all input files together and use the same sample name. The, the pipeline groups all files and makes one merged reprocessed bam file.

kdaily commented 5 years ago

Thanks for the explanations @bintriz! Make sure this information makes it into the readme.

This is slightly out of scope, but allowing multiple input formats and moving logic into the code is not optimal. There should be separate steps for each task - if you start with bam you are performing two tasks, and they should be separated out.

bintriz commented 5 years ago

If somebody can make it optimal, it would be good. I made it just working.

attilagk commented 5 years ago

@bintriz as you suggested I put together R1 and R2 files in the following ways but received the same error as before. See details below. The input files are in this Synapse folder (syn17931318).

The first way I tried was using local FASTQ files:

#sample_id      file_name       location
sample1 CSNeuNP_S7_L001_R1-small.fq.gz  /home/ec2-user/aln-test/common-sample/CSNeuNP_S7_L001_R1-small.fq.gz
sample1 CSNeuNP_S7_L001_R2-small.fq.gz  /home/ec2-user/aln-test/common-sample/CSNeuNP_S7_L001_R2-small.fq.gz

The second way was with Synapse location:

#sample_id      file_name       location
sample0 CSNeuNP_S7_L001_R1-small.fq.gz  syn17931354
sample0 CSNeuNP_S7_L001_R2-small.fq.gz  syn17932616
bintriz commented 5 years ago

In my environment, your sample file is working. Looks like the error means somehow the python library handling qsub doesn't parse job id well in your pcluster. I'll look at it.

attilagk commented 5 years ago

@bintriz would it help you if I gave you access to our AWS EC2 instance?

bintriz commented 5 years ago

That’s a good idea. I’ll send my public key in separate email. Please add it into ~/.ssh/authorized_keys. Then, I may be able to log in your AWS cluster.

On Jan 29, 2019, at 1:22 PM, Attila Gulyás-Kovács notifications@github.com wrote:

@bintriz https://github.com/bintriz would it help you if I gave you access to our AWS EC2 instance?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bsmn/bsmn-pipeline/issues/19#issuecomment-458672116, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOGeyEXczQZCoH1oFB5n9ZKspIz2Eimks5vIJ91gaJpZM4aEDzd.

bintriz commented 5 years ago

Hi Attila,

I found the reason. This error is due to that your SGE system wasn’t set up a parallel environment named “threaded” which this pipeline job scripts rely on. README already mentions it in https://github.com/bsmn/bsmn-pipeline#extra-set-up-for-sge https://github.com/bsmn/bsmn-pipeline#extra-set-up-for-sge. Please set this up and try again. It would work.

On Jan 29, 2019, at 1:33 PM, Taejeong Bae bintriz@gmail.com wrote:

That’s a good idea. I’ll send my public key in separate email. Please add it into ~/.ssh/authorized_keys. Then, I may be able to log in your AWS cluster.

On Jan 29, 2019, at 1:22 PM, Attila Gulyás-Kovács <notifications@github.com mailto:notifications@github.com> wrote:

@bintriz https://github.com/bintriz would it help you if I gave you access to our AWS EC2 instance?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bsmn/bsmn-pipeline/issues/19#issuecomment-458672116, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOGeyEXczQZCoH1oFB5n9ZKspIz2Eimks5vIJ91gaJpZM4aEDzd.

attilagk commented 5 years ago

@bintriz , I didn't know that AWS EC2 was also an SGE system therefore I ignored that part of the documentation since it seemed irrelevant for AWS EC2.

attilagk commented 5 years ago

@bintriz , in any case, I ran the code in the documentation but got the following error message. Can you advice what might have gone wrong? Thanks.

[ec2-user@ip-172-31-4-155 ~]$ sudo su                                      
[root@ip-172-31-4-155 ec2-user]# qconf -Ap << END                       
> pe_name            threaded                                     
> slots              99999                                                     
> user_lists         NONE                                     
> xuser_lists        NONE                                              
> start_proc_args    NONE                                                   
> stop_proc_args     NONE                                                    
> allocation_rule    $pe_slots                                      
> control_slaves     FALSE                                                         
> job_is_first_task  TRUE                                                                             
> urgency_slots      min                                              
> accounting_summary TRUE                                                                             
> qsort_args         NONE                                  
> END                                                                
error: no option argument provided to "-Ap"                                            
SGE 8.1.9                                                                                        
usage: qconf [options]                                           
   [-aattr obj_nm attr_nm val obj_id_list]  add to a list attribute of an object
...
bintriz commented 5 years ago

I realized that heredoc doesn't work with qconf -Ap. So, I separate it into two steps by creating a temp file. Try this below.

$ sudo su
# cat <<END >/tmp/tmp.conf
pe_name            threaded
slots              99999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    \$pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE
END
# qconf -Ap /tmp/tmp.conf && rm /tmp/tmp.conf
# qconf -mattr queue pe_list threaded all.q
attilagk commented 5 years ago

Hi @bintriz I did the qconf now with success, at least I didn't get any error messages.

I submitted the same pair of small FASTQ files for mapping as before; see my previous note in this thread. As before, I called your genome_mapping.sh twice: first with the FASTQ files on synapse and second with the same FASTQs locally on our AWS EC2 instance.

This time I didn't get any error for running genome_mapping.sh. But how can I check that the mapping script is really running?

attilagk commented 5 years ago

Hi @kdaily sorry if this is not the best place to report this kind issue but I got SynapseFileCacheError when I attempted to store a small FASTQ file on the BSMN scratch space in Synapse. This happened both with the command line and the python client (see code blocks below). Can you please suggest me how to trouble shoot this? Thanks.

With the command line synapse client

attila@ada:/projects/bsm/reads/2016-05-02-MS-common-sample$ synapse store CSNeuNP_S7_L001_R1_001.fastq.gz --parentId syn18233615

##################################################                         
 Uploading file to your external S3 storage
##################################################

Uploading [####################]100.00%   9.8MB/9.8MB (9.1MB/s) CSNeuNP_S7_L001_R1_001.fastq.gz Done...

SynapseFileCacheError: Could not obtain a lock on the file cache within timeout: 0:01:10  Please try again later

With the python synapseclient:

     84             raise SynapseFileCacheError("Could not obtain a lock on the file cache within timeout: %s  "
---> 85                                         "Please try again later" % str(timeout))
bintriz commented 5 years ago

genome_mapping.sh is just a submitter of jobs to SGE. The qstat command of SGE will give you the current job status. By the way, if you run the same samples twice with the same sample id, it would be problem. My job scripts rely on the sample id to create working directory. So, two sets of jobs with the same sample id would compete each other and try to overwrite files with the same file name.

attilagk commented 5 years ago

Thanks for the explanation @bintriz. Hopefully the two jobs won't mess up each other because they differ in their input: for the first job it's local files and for the second job it's files on Synapse. The fact that the files in the two location are copies of each other should doesn't matter, should it?

bintriz commented 5 years ago

Once the files are downloaded, all of the names of intermediate and result files are determined based on the sample id using it as a prefix. So, if you use the same sample id, it should be a problem.

attilagk commented 5 years ago

I see. I've just checked the two jobs with qstat. Both are in the Eqw state

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     14 0.55500 pre_3.subm ec2-user     Eqw   02/11/2019 15:34:30                                    1
     19 0.55500 pre_3.subm ec2-user     Eqw   02/11/2019 15:37:29                                    1

The start of both jobs had 100 exit status as assessed by qstat -j [job id]

error reason          1:      02/11/2019 18:03:26 [498:6058]: exit_status of job start = 100

I deleted both jobs and now I try to resubmit only one of them :)

attilagk commented 5 years ago

Hi @bintriz I've been checking the status of the mapping I submitted a week ago. pre_1.down and pre_2.spli have been completed but pre_3.subm has not.

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     24 0.55500 pre_3.subm ec2-user     Eqw   02/12/2019 10:07:34                                    1

Something went wrong in pre_3.subm.

[ec2-user@ip-172-31-4-155 ~]$ qstat -j 24 | grep error
error reason          1:      02/12/2019 13:28:41 [498:5642]: exit_status of job start = 100

The logfile pre_3.submit_aln_jobs.sh.o14 contains the following traceback:

[ec2-user@ip-172-31-4-155 logs]$ cat pre_3.submit_aln_jobs.sh.o14
---
[Mon Feb 11 18:03:26 CST 2019] Start submit_aln_jobs.
Traceback (most recent call last):
  File "/shared/bsmn_pipeline/genome_mapping/submit_aln_jobs.py", line 60, in <module>
    main()
  File "/shared/bsmn_pipeline/genome_mapping/submit_aln_jobs.py", line 27, in main
    "{job_home}/aln_2.merge_bam.sh {sample}".format(job_home=job_home, sample=args.sample))
  File "/shared/bsmn_pipeline/library/job_queue.py", line 87, in submit
    jid = m.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
attilagk commented 5 years ago

The previous error was raised when I tried to map fastq files stored on the AWS EC2 instance. I deleted the pending job from the queue and started a new run but this time with the same fastq files stored on the BSMN scratch space on Synapse.

attilagk commented 5 years ago

Hi @bintriz I submitted two days ago the mapping jobs for the small test FASTQs that are stored on Synapse (the BSMN Scratch Space). The jobs are still in the waiting queue. What do you think, is two days waiting time in the queue is normal?

[ec2-user@ip-172-31-4-155 ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     25 0.55500 pre_1.down ec2-user     qw    02/19/2019 09:18:42                                    1
     27 0.55500 pre_1.down ec2-user     qw    02/19/2019 09:18:42                                    1
     26 0.00000 pre_2.spli ec2-user     hqw   02/19/2019 09:18:42                                    3
     28 0.00000 pre_2.spli ec2-user     hqw   02/19/2019 09:18:42                                    3
     29 0.00000 pre_3.subm ec2-user     hqw   02/19/2019 09:18:42                                    1
attilagk commented 5 years ago

Hi @bintriz! The latest run also failed, but this time the error occurred in pre_2.spli. In this case the input FASTQs were stored in the BSMN Scratch Space.

Apart from the error, the jobs were in qw (queue waiting) state for almost a week.

Below are the details.

[ec2-user@ip-172-31-4-155 ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     26 0.55500 pre_2.spli ec2-user     Eqw   02/19/2019 09:18:42                                    3
     28 0.55500 pre_2.spli ec2-user     Eqw   02/19/2019 09:18:42                                    3
     29 0.00000 pre_3.subm ec2-user     hqw   02/19/2019 09:18:42                                    1
[ec2-user@ip-172-31-4-155 ~]$ qstat -j 26 | grep error
error reason          1:      02/22/2019 18:18:11 [498:5472]: exit_status of job start = 100
kdaily commented 5 years ago

@attilagk I can't think of any reason other than misconfiguration that your jobs would not run. All nodes that you are requesting should be built immediately, unless you are using spot pricing for EC2 instances. The default for CFN (and ParallelCluster) is to use on demand nodes though.

kdaily commented 5 years ago

Exit code 100 is usually reserved for 'general error' which is defined by the application running. Looks like @bintriz is doing something with that here:

https://github.com/bsmn/bsmn-pipeline/blob/master/genome_mapping/job_scripts/pre_1.download.sh#L5