Cufflinks transcript assembly post-STAR

aadamk commented 6 years ago

I get the following error post-star alignment and during cufflinks transcript assembly:
... 196662 genes finished Traceback (most recent call last): File "/home/aadam/local/bin/bcbio_nextgen.py", line 234, in main(kwargs) File "/home/aadam/local/bin/bcbio_nextgen.py", line 43, in main run_main(kwargs) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 42, in run_main fc_dir, run_info_yaml) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 86, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 243, in rnaseqpipeline samples = rnaseq.assemble_transcripts(run_parallel, samples) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/rnaseq.py", line 285, in assemble_transcripts samples = run_parallel("cufflinks_merge", [samples]) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items): File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in call while self.dispatch_one_batch(iterator): File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch self._dispatch(tasks) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch job = ImmediateComputeBatch(batch) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in init self.results = batch() File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 50, in wrapper return apply(f, args, *kwargs) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 387, in cufflinks_merge return rnaseq.cufflinks_merge(args) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/rnaseq.py", line 246, in cufflinks_merge samples[0][0]) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/rnaseq/cufflinks.py", line 259, in merge filtered = annotate_gtf.cleanup_transcripts(classified, gtf_file, ref_file) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/rnaseq/annotate_gtf.py", line 52, in cleanup_transcripts assembled_db = gtf.get_gtf_db(assembled_gtf) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/rnaseq/gtf.py", line 43, in get_gtf_db return gffutils.FeatureDB(db_file) File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/gffutils/interface.py", line 131, in init version, dialect = c.fetchone() TypeError: 'NoneType' object is not iterable

Not sure why there is an empty object - after completing the processing of ~19000 loci, the following warning appears in the middle of my log file: [2017-11-06T16:04Z] multiprocessing: cufflinks_merge /home/aadam/bcbio/anaconda/lib/python2.7/site-packages/gffutils/create.py:85: UserWarning: 'infer_gene_extent' will be deprecated. For now, the following equivalent values were automatically set: 'disable_infer_genes=True', 'disable_infer_transcripts=True'. Please use these instead in the future. warnings.warn("'infer_gene_extent' will be deprecated. For now, "

After this, there are numerous messages stating 'x # of genes finished' after conversion of the .gtf to fasta format, then this error occurs. Not sure I understand it.

roryk commented 6 years ago

Thanks, sorry about that. That warning is innocuous and isn't the root cause of the problem. The steps that it is doing there is it is taking the cufflinks-assembly.gtf files for each sample, merging them together into a merged.gtf file and then cleaning up the final merged file into an assembly.gtf file. Are the cufflinks-assembly.gtf files floating around and do they look weird? What about the merged.gtf and assembly files?

aadamk commented 6 years ago

Don't have much experience with cufflinks, but it seems as if those files contain the pertinent information. Below is a snippet of the 'cleaned' assembly.gtf file for this sample:

1 Cufflinks transcript 569276 569608 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "2.1859046269"; frac "1.000000"; conf_lo "0.640237"; conf_hi "3.731573"; cov "5.921180"; 1 Cufflinks exon 569276 569608 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "2.1859046269"; frac "1.000000"; conf_lo "0.640237"; conf_hi "3.731573"; cov "5.921180"; 1 Cufflinks transcript 717239 717541 1000 . . gene_id "CUFF.7"; transcript_id "CUFF.7.1"; FPKM "8.3411101194"; frac "1.000000"; conf_lo "5.057095"; conf_hi "11.625125"; cov "22.319983"; 1 Cufflinks exon 717239 717541 1000 . . gene_id "CUFF.7"; transcript_id "CUFF.7.1"; exon_number "1"; FPKM "8.3411101194"; frac "1.000000"; conf_lo "5.057095"; conf_hi "11.625125"; cov "22.319983"; 1 Cufflinks transcript 762045 762421 1000 . . gene_id "CUFF.9"; transcript_id "CUFF.9.1"; FPKM "7.1802048868"; frac "1.000000"; conf_lo "4.636590"; conf_hi "9.723819"; cov "19.380856"; 1 Cufflinks exon 762045 762421 1000 . . gene_id "CUFF.9"; transcript_id "CUFF.9.1"; exon_number "1"; FPKM "7.1802048868"; frac "1.000000"; conf_lo "4.636590"; conf_hi "9.723819"; cov "19.380856"; 1 Cufflinks transcript 809360 809752 1000 . . gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "2.1067843741"; frac "1.000000"; conf_lo "0.746460"; conf_hi "3.467109"; cov "5.677438";

And a snippet of the merged.gtf generated in the assembly/cuffmerge directory (which does not seem to be a sample-specific directory):

1 protein_coding exon 860260 860328 . + . exon_number "1"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 861302 861393 . + . exon_number "2"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 865535 865716 . + . exon_number "3"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 866419 866469 . + . exon_number "4"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 871152 871276 . + . exon_number "5"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 874420 874509 . + . exon_number "6"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 874655 874671 . + . exon_number "7"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 860530 860569 . + . exon_number "1"; p_id "P5"; gene_id "ENSG00000187634"; tss_id "TSS57"; transcript_id "ENST00000437963"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 861302 861393 . + . exon_number "2"; p_id "P5"; gene_id "ENSG00000187634"; tss_id "TSS57"; transcript_id "ENST00000437963"; class_code "="; gene_name "SAMD11";

roryk commented 6 years ago

Thanks! Could you pass along the entire assembled.gtf file so I can see if I can figure out what is wrong?

aadamk commented 6 years ago

cufflinks-assembly-L1-4.zip Sure - I've attached all 4 .gtf's associated with this sample - the reason there are 4 is that the sample was distributed to 4 lanes of a Nextseq for paired-end sequencing, and I chose to include all 4 fastq pairs in a single .yaml file.
Thanks!

aadamk commented 6 years ago

I corrected my batch info under metadata within the .yaml file and that seemed to resolve the issue. Before I close this, one last question. I ran the STAR aligner on paired-end Nextseq data, so each sample has 4 sets, corresponding to 4 lanes, of paired fastqs. I ran the alignment separately for each lane to get separate qc metrics generated by STAR in the log.final.out file. This file has a lot of useful metrics such as % unmapped reads, % reads mapping to multiple loci, etc. Is there any way that I can merge the alignment data across the 4 lanes and obtain those same metrics? I attempted merging the bam files with samtools followed by Picardtools collectRNAseqmetrics and samtools 'stats', but neither quite give me the data that was generated by STAR (can't seem to find command or function in STAR manual either). Any insight you have would be great. Thanks!

roryk commented 6 years ago

Hi @aadamk,

For same-sample split across lanes, we usually merge the FASTQ files for each lane into one prior to running bcbio. We have a script to handle that: http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#multiple-files-per-sample

You can also just cat the FASTQ files together yourself, as long as they are not compressed or are gzipped. That is all that script is doing for the most part.

aadamk commented 6 years ago

Got it, thank you.

bcbio / bcbio-nextgen

Cufflinks transcript assembly post-STAR #2140