Closed aadamk closed 6 years ago
Thanks, sorry about that. That warning is innocuous and isn't the root cause of the problem. The steps that it is doing there is it is taking the cufflinks-assembly.gtf files for each sample, merging them together into a merged.gtf file and then cleaning up the final merged file into an assembly.gtf file. Are the cufflinks-assembly.gtf files floating around and do they look weird? What about the merged.gtf and assembly files?
Don't have much experience with cufflinks, but it seems as if those files contain the pertinent information. Below is a snippet of the 'cleaned' assembly.gtf file for this sample:
1 Cufflinks transcript 569276 569608 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "2.1859046269"; frac "1.000000"; conf_lo "0.640237"; conf_hi "3.731573"; cov "5.921180"; 1 Cufflinks exon 569276 569608 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "2.1859046269"; frac "1.000000"; conf_lo "0.640237"; conf_hi "3.731573"; cov "5.921180"; 1 Cufflinks transcript 717239 717541 1000 . . gene_id "CUFF.7"; transcript_id "CUFF.7.1"; FPKM "8.3411101194"; frac "1.000000"; conf_lo "5.057095"; conf_hi "11.625125"; cov "22.319983"; 1 Cufflinks exon 717239 717541 1000 . . gene_id "CUFF.7"; transcript_id "CUFF.7.1"; exon_number "1"; FPKM "8.3411101194"; frac "1.000000"; conf_lo "5.057095"; conf_hi "11.625125"; cov "22.319983"; 1 Cufflinks transcript 762045 762421 1000 . . gene_id "CUFF.9"; transcript_id "CUFF.9.1"; FPKM "7.1802048868"; frac "1.000000"; conf_lo "4.636590"; conf_hi "9.723819"; cov "19.380856"; 1 Cufflinks exon 762045 762421 1000 . . gene_id "CUFF.9"; transcript_id "CUFF.9.1"; exon_number "1"; FPKM "7.1802048868"; frac "1.000000"; conf_lo "4.636590"; conf_hi "9.723819"; cov "19.380856"; 1 Cufflinks transcript 809360 809752 1000 . . gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "2.1067843741"; frac "1.000000"; conf_lo "0.746460"; conf_hi "3.467109"; cov "5.677438";
And a snippet of the merged.gtf generated in the assembly/cuffmerge directory (which does not seem to be a sample-specific directory):
1 protein_coding exon 860260 860328 . + . exon_number "1"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 861302 861393 . + . exon_number "2"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 865535 865716 . + . exon_number "3"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 866419 866469 . + . exon_number "4"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 871152 871276 . + . exon_number "5"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 874420 874509 . + . exon_number "6"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 874655 874671 . + . exon_number "7"; p_id "P4"; gene_id "ENSG00000187634"; tss_id "TSS56"; transcript_id "ENST00000420190"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 860530 860569 . + . exon_number "1"; p_id "P5"; gene_id "ENSG00000187634"; tss_id "TSS57"; transcript_id "ENST00000437963"; class_code "="; gene_name "SAMD11"; 1 protein_coding exon 861302 861393 . + . exon_number "2"; p_id "P5"; gene_id "ENSG00000187634"; tss_id "TSS57"; transcript_id "ENST00000437963"; class_code "="; gene_name "SAMD11";
Thanks! Could you pass along the entire assembled.gtf file so I can see if I can figure out what is wrong?
cufflinks-assembly-L1-4.zip
Sure - I've attached all 4 .gtf's associated with this sample - the reason there are 4 is that the sample was distributed to 4 lanes of a Nextseq for paired-end sequencing, and I chose to include all 4 fastq pairs in a single .yaml file.
Thanks!
I corrected my batch info under metadata within the .yaml file and that seemed to resolve the issue. Before I close this, one last question. I ran the STAR aligner on paired-end Nextseq data, so each sample has 4 sets, corresponding to 4 lanes, of paired fastqs. I ran the alignment separately for each lane to get separate qc metrics generated by STAR in the log.final.out file. This file has a lot of useful metrics such as % unmapped reads, % reads mapping to multiple loci, etc. Is there any way that I can merge the alignment data across the 4 lanes and obtain those same metrics? I attempted merging the bam files with samtools followed by Picardtools collectRNAseqmetrics and samtools 'stats', but neither quite give me the data that was generated by STAR (can't seem to find command or function in STAR manual either). Any insight you have would be great. Thanks!
Hi @aadamk,
For same-sample split across lanes, we usually merge the FASTQ files for each lane into one prior to running bcbio. We have a script to handle that: http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#multiple-files-per-sample
You can also just cat the FASTQ files together yourself, as long as they are not compressed or are gzipped. That is all that script is doing for the most part.
Got it, thank you.
I get the following error post-star alignment and during cufflinks transcript assembly:
main(kwargs)
File "/home/aadam/local/bin/bcbio_nextgen.py", line 43, in main
run_main(kwargs)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 42, in run_main
fc_dir, run_info_yaml)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 86, in _run_toplevel
for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 243, in rnaseqpipeline
samples = rnaseq.assemble_transcripts(run_parallel, samples)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/rnaseq.py", line 285, in assemble_transcripts
samples = run_parallel("cufflinks_merge", [samples])
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in call
while self.dispatch_one_batch(iterator):
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in init
self.results = batch()
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in call
return [func(*args, *kwargs) for func, args, kwargs in self.items]
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 50, in wrapper
return apply(f, args, *kwargs)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 387, in cufflinks_merge
return rnaseq.cufflinks_merge(args)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/rnaseq.py", line 246, in cufflinks_merge
samples[0][0])
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/rnaseq/cufflinks.py", line 259, in merge
filtered = annotate_gtf.cleanup_transcripts(classified, gtf_file, ref_file)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/rnaseq/annotate_gtf.py", line 52, in cleanup_transcripts
assembled_db = gtf.get_gtf_db(assembled_gtf)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/bcbio/rnaseq/gtf.py", line 43, in get_gtf_db
return gffutils.FeatureDB(db_file)
File "/home/aadam/bcbio/anaconda/lib/python2.7/site-packages/gffutils/interface.py", line 131, in init
version, dialect = c.fetchone()
TypeError: 'NoneType' object is not iterable
... 196662 genes finished Traceback (most recent call last): File "/home/aadam/local/bin/bcbio_nextgen.py", line 234, in
Not sure why there is an empty object - after completing the processing of ~19000 loci, the following warning appears in the middle of my log file: [2017-11-06T16:04Z] multiprocessing: cufflinks_merge /home/aadam/bcbio/anaconda/lib/python2.7/site-packages/gffutils/create.py:85: UserWarning: 'infer_gene_extent' will be deprecated. For now, the following equivalent values were automatically set: 'disable_infer_genes=True', 'disable_infer_transcripts=True'. Please use these instead in the future. warnings.warn("'infer_gene_extent' will be deprecated. For now, "
After this, there are numerous messages stating 'x # of genes finished' after conversion of the .gtf to fasta format, then this error occurs. Not sure I understand it.