NUStatBioinfo / DegNorm

Normalizing RNA degradation in RNA-seq data
https://nustatbioinfo.github.io/DegNorm/
3 stars 1 forks source link

Error when loading reads #33

Open symbiologist opened 4 years ago

symbiologist commented 4 years ago

Hi,

I am attempting to run DegNorm on some human RNA-seq data that I aligned using hisat2. It is able to load the files and the GTF, but consistently gives me the error below after it starts loading the reads. Happy to provide a truncated GTF file and example bam if needed!

DegNorm (09/16/2019 07:04:15) ---- DegNorm output directory -- degnorm/degnorm_09162019_190415 DegNorm (09/16/2019 07:04:15) ---- SAMPLE s49 -- sample contains single-end reads DegNorm (09/16/2019 07:04:15) ---- SAMPLE s56 -- sample contains single-end reads DegNorm (09/16/2019 07:04:15) ---- SAMPLE s54 -- sample contains single-end reads DegNorm (09/16/2019 07:04:16) ---- SAMPLE s82 -- sample contains single-end reads DegNorm (09/16/2019 07:04:16) ---- SAMPLE s65 -- sample contains single-end reads DegNorm (09/16/2019 07:04:16) ---- SAMPLE s73 -- sample contains single-end reads DegNorm (09/16/2019 07:04:16) ---- SAMPLE s95 -- sample contains single-end reads DegNorm (09/16/2019 07:04:16) ---- SAMPLE s33 -- sample contains single-end reads DegNorm (09/16/2019 07:04:16) ---- Begin genome annotation file processing... DegNorm (09/16/2019 07:04:16) ---- Loading genome annotation file /media/data1/dwu/tbi/plasma/data/plasma_integrated/degnorm/GRCh38.95.pasa_f2.gtf... DegNorm (09/16/2019 07:04:27) ---- Subsetting exon data to 194 chromosomes: 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, GL000008.2, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000208.1, GL000213.1, GL000214.1, GL000216.2, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000224.1, GL000225.1, GL000226.1, KI270302.1, KI270303.1, KI270304.1, KI270305.1, KI270310.1, KI270311.1, KI270312.1, KI270315.1, KI270316.1, KI270317.1, KI270320.1, KI270322.1, KI270329.1, KI270330.1, KI270333.1, KI270334.1, KI270335.1, KI270336.1, KI270337.1, KI270338.1, KI270340.1, KI270362.1, KI270363.1, KI270364.1, KI270366.1, KI270371.1, KI270372.1, KI270373.1, KI270374.1, KI270375.1, KI270376.1, KI270378.1, KI270379.1, KI270381.1, KI270382.1, KI270383.1, KI270384.1, KI270385.1, KI270386.1, KI270387.1, KI270388.1, KI270389.1, KI270390.1, KI270391.1, KI270392.1, KI270393.1, KI270394.1, KI270395.1, KI270396.1, KI270411.1, KI270412.1, KI270414.1, KI270417.1, KI270418.1, KI270419.1, KI270420.1, KI270422.1, KI270423.1, KI270424.1, KI270425.1, KI270429.1, KI270435.1, KI270438.1, KI270442.1, KI270448.1, KI270465.1, KI270466.1, KI270467.1, KI270468.1, KI270507.1, KI270508.1, KI270509.1, KI270510.1, KI270511.1, KI270512.1, KI270515.1, KI270516.1, KI270517.1, KI270518.1, KI270519.1, KI270521.1, KI270522.1, KI270528.1, KI270529.1, KI270530.1, KI270538.1, KI270539.1, KI270544.1, KI270548.1, KI270579.1, KI270580.1, KI270581.1, KI270582.1, KI270583.1, KI270584.1, KI270587.1, KI270588.1, KI270589.1, KI270590.1, KI270591.1, KI270593.1, KI270706.1, KI270707.1, KI270708.1, KI270709.1, KI270710.1, KI270711.1, KI270712.1, KI270713.1, KI270714.1, KI270715.1, KI270716.1, KI270717.1, KI270718.1, KI270719.1, KI270720.1, KI270721.1, KI270722.1, KI270723.1, KI270724.1, KI270725.1, KI270726.1, KI270727.1, KI270728.1, KI270729.1, KI270730.1, KI270731.1, KI270732.1, KI270733.1, KI270734.1, KI270735.1, KI270736.1, KI270737.1, KI270738.1, KI270739.1, KI270740.1, KI270741.1, KI270742.1, KI270743.1, KI270744.1, KI270745.1, KI270746.1, KI270747.1, KI270748.1, KI270749.1, KI270750.1, KI270751.1, KI270752.1, KI270753.1, KI270754.1, KI270755.1, KI270756.1, KI270757.1, MT, X, Y DegNorm (09/16/2019 07:04:27) ---- Successfully loaded exon data -- shape: (608576, 4) DegNorm (09/16/2019 07:04:27) ---- Begin genome annotation file processing. DegNorm (09/16/2019 07:04:45) ---- Processing successful. Final shape -- (606976, 6) DegNorm (09/16/2019 07:04:45) ---- Found 25 chromosomes in intersection of all experiments and gene annotation data: 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y DegNorm (09/16/2019 07:04:45) ---- Determining gene overlap structure across chromosomes. DegNorm (09/16/2019 07:04:50) ---- Rate of gene overlap: 39741 / 59551 DegNorm (09/16/2019 07:04:50) ---- Loading RNA-seq data file 1 / 8 DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49 -- sample contains single-end reads DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49: begin computing coverage, read counts for 25 chromosomes... DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 1 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 10 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 11 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 12 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 13 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 14 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 15 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 16 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 17 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 18 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 19 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 2 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 20 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 21 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:50) ---- SAMPLE s49, CHR 22 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 3 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 4 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 5 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 6 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 7 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 8 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR 9 -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR MT -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR X -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:04:51) ---- SAMPLE s49, CHR Y -- begin loading reads from hisat2_subset/s49.bam DegNorm (09/16/2019 07:05:01) ---- SAMPLE s49, CHR 18 -- reads successfully loaded. shape = (32648, 3) DegNorm (09/16/2019 07:05:04) ---- SAMPLE s49, CHR 22 -- reads successfully loaded. shape = (40800, 3) DegNorm (09/16/2019 07:05:05) ---- SAMPLE s49, CHR 20 -- reads successfully loaded. shape = (51233, 3) DegNorm (09/16/2019 07:05:08) ---- SAMPLE s49, CHR 10 -- reads successfully loaded. shape = (72760, 3) DegNorm (09/16/2019 07:05:11) ---- SAMPLE s49, CHR 4 -- reads successfully loaded. shape = (87460, 3) DegNorm (09/16/2019 07:05:13) ---- SAMPLE s49, CHR 19 -- reads successfully loaded. shape = (81565, 3) DegNorm (09/16/2019 07:05:13) ---- SAMPLE s49, CHR Y -- reads successfully loaded. shape = (4349, 3) DegNorm (09/16/2019 07:05:14) ---- SAMPLE s49, CHR 15 -- reads successfully loaded. shape = (68310, 3) DegNorm (09/16/2019 07:05:15) ---- SAMPLE s49, CHR 17 -- reads successfully loaded. shape = (88072, 3) DegNorm (09/16/2019 07:05:16) ---- SAMPLE s49, CHR 7 -- reads successfully loaded. shape = (93724, 3) DegNorm (09/16/2019 07:05:16) ---- SAMPLE s49, CHR 13 -- reads successfully loaded. shape = (39463, 3) DegNorm (09/16/2019 07:05:17) ---- SAMPLE s49, CHR X -- reads successfully loaded. shape = (74915, 3) DegNorm (09/16/2019 07:05:18) ---- SAMPLE s49, CHR 11 -- reads successfully loaded. shape = (108328, 3) DegNorm (09/16/2019 07:05:18) ---- SAMPLE s49, CHR 3 -- reads successfully loaded. shape = (93726, 3) DegNorm (09/16/2019 07:05:21) ---- SAMPLE s49, CHR 6 -- reads successfully loaded. shape = (116257, 3) DegNorm (09/16/2019 07:05:23) ---- SAMPLE s49, CHR 5 -- reads successfully loaded. shape = (87289, 3) DegNorm (09/16/2019 07:05:23) ---- SAMPLE s49, CHR 9 -- reads successfully loaded. shape = (67594, 3) DegNorm (09/16/2019 07:05:24) ---- SAMPLE s49, CHR 2 -- reads successfully loaded. shape = (151278, 3) DegNorm (09/16/2019 07:05:24) ---- SAMPLE s49, CHR 16 -- reads successfully loaded. shape = (63918, 3) DegNorm (09/16/2019 07:05:26) ---- SAMPLE s49, CHR 12 -- reads successfully loaded. shape = (115429, 3) DegNorm (09/16/2019 07:05:29) ---- SAMPLE s49, CHR 8 -- reads successfully loaded. shape = (77380, 3) DegNorm (09/16/2019 07:05:38) ---- SAMPLE s49, CHR 14 -- reads successfully loaded. shape = (154661, 3) DegNorm (09/16/2019 07:05:41) ---- SAMPLE s49, CHR 1 -- reads successfully loaded. shape = (170919, 3) Traceback (most recent call last): File "/home/dwu/miniconda3/envs/degnorm/bin/degnorm", line 11, in load_entry_point('DegNorm==0.1.4', 'console_scripts', 'degnorm')() File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/DegNorm-0.1.4-py3.6.egg/degnorm/main.py", line 137, in main , exon_df=exon_df) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/DegNorm-0.1.4-py3.6.egg/degnorm/reads.py", line 847, in coverage_read_counts for chrom in self.chroms) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/joblib/parallel.py", line 994, in call self.retrieve() File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/joblib/parallel.py", line 897, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 561, in call return self.func(args, **kwargs) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/joblib/parallel.py", line 261, in call for func, args, kwargs in self.items] File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/joblib/parallel.py", line 261, in for func, args, kwargs in self.items] File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/DegNorm-0.1.4-py3.6.egg/degnorm/reads.py", line 405, in chromosome_coverage_read_counts lambda x: sum([int(k) for k, v in re.findall(r'(\d+)([A-Z]?)', x)])) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/pandas/core/series.py", line 3591, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/site-packages/DegNorm-0.1.4-py3.6.egg/degnorm/reads.py", line 405, in lambda x: sum([int(k) for k, v in re.findall(r'(\d+)([A-Z]?)', x)])) File "/home/dwu/miniconda3/envs/degnorm/lib/python3.6/re.py", line 222, in findall return _compile(pattern, flags).findall(string) TypeError: expected string or bytes-like object

Thank you!

David

ffineis-aptitive commented 4 years ago

Hey David,

Thanks for using DegNorm, and thanks for posting an informative stacktrace. Looks like it's a string formatting issue related to the cigar strings. Please post or link to an example .bam file.

Frank

symbiologist commented 4 years ago

Hi Frank,

Of course! here is the bam of first sample, S49, where the error occurred: https://www.dropbox.com/s/30klsrhxsj8b69h/s49.bam?dl=0 And here is the .bai: https://www.dropbox.com/s/qj8q2c4uql5dmnq/s49.bai?dl=0

It's about 3.2 GB bam file. Also I just realized while looking at log that these should be paired-end samples, but DegNorm recognized them as single-end - do you think this is related to the error?

ffineis commented 4 years ago

Hey David,

Excellent, thanks for sending. Unfortunately I won't have time to debug until the weekend, which is when Northwestern's HPC comes back online anyway. But yes, your highlighting of the single/paired-end read thing it's likely a similar formatting issue - paired reads are determined by whether or not the read qnames are suffixed with a ".2" or a ".1" (single-end). I think upon cracking open your .bam I'll be able to diagnose.

symbiologist commented 4 years ago

Hi Frank,

Great, thank you. Also if it is helpful to know, I generated these bam files using the following parameters on hisat2:

--dta (downstream transcriptome assembly, requires longer anchor lengths for de novo splice sites) --rna-strandness RF (reverse stranded paired-end reads)

Other than these I did not apply custom parameters.

Best,

David

ffineis commented 4 years ago

Hey David,

Dug into this - it's a hisat thing. Our coverage calculation code requires read qname to end in .1 or .2 in order to identify pairs of reads and group them together. If such a pattern is not detected, reads are assumed to be single-end. Your reads look like this:

Screen Shot 2019-09-19 at 8 33 49 PM

As you can see, the read query names do not match this pattern. It is not clear to me how one is supposed to identify two reads that form a pair based on this data. If you determine the logic, I might be open to implementing it.

Moreover, as you can see, some of the CIGAR scores appear to be missing. Degnorm uses pysam to load .bam files, for reference. I'm not totally sure how to fix this with hisat2. I would recommend more research into hisat2 or using tophat (we developed degnorm using tophat alignments).

Here is a sample .sam file we use for degnorm testing - you'll see the qname pattern I'm referring to.

hg_small_1_sample.sam.zip

symbiologist commented 4 years ago

Hi Frank,

Thanks for taking a look at this! I'm not sure why some CIGAR scores are missing - I'll look more into this, but at least it appears to be a minority of reads. Should each pair have the same CIGAR? If so, perhaps the missing ones are mate 2.

Regarding read1 and read2 - I opened up my bam file in IGV and it looks like the reads of each pair has identical names. Here is an example:

Screenshot 2019-09-23 11 36 14

I also took a look at the HISAT2 manual and found this. It looks like the flags can identify mate 1 and mate 2 of the pair:

Screenshot 2019-09-23 13 33 48

From https://ccb.jhu.edu/software/hisat2/manual.shtml#sam-output

Would this be straightforward to implement? Since HISAT2 is supposed to replace tophat (even the tophat website states it is superseded by HISAT2 which is "more accurate and much more efficient"), I imagine many other users will be coming from HISAT2 in the future as well. I understand this may take some time to incorporate!