MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Problem with de novo miRNA annation with SS 4.0.2 #138

Closed FlaviaPavan closed 9 months ago

FlaviaPavan commented 1 year ago

Hi Dr Axtell,

I predict de novo miRNAs with SS 4.0.2 and the analysis didn't work for one of my bam files (otherwise SS works perfectly for the others). On the other hand, I have previously predicted miRNAs with SS 3.8.5 for the same bam without any problems. Here's the problem that appeared in the log:

Analyzing cluster properties using 2 threads multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/eep/softwares/miniconda/envs/shortstack-4.0.2/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/eep/softwares/miniconda/envs/shortstack-4.0.2/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/eep/softwares/miniconda/envs/shortstack-4.0.2/bin/ShortStack", line 1778, in quant for row in reader: _csv.Error: field larger than field limit (131072) """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/eep/softwares/miniconda/envs/shortstack-4.0.2/bin/ShortStack", line 3588, in qdata, pmir_bedfile = quant_controller(args, merged_bam, cluster_bed, read_count) File "/eep/softwares/miniconda/envs/shortstack-4.0.2/bin/ShortStack", line 1996, in quant_controller q_results = pool.starmap(quant, q_iter) File "/eep/softwares/miniconda/envs/shortstack-4.0.2/lib/python3.10/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/eep/softwares/miniconda/envs/shortstack-4.0.2/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value _csv.Error: field larger than field limit (131072)"

Can you tell me what the problem is? Thanks, Flavia Pavan

MikeAxtell commented 1 year ago

The error message _csv.Error: field larger than field limit (131072) indicates that you have some unusual lines in your BAM file. Specifically, it appears there are one or more lines where there are more than 131072 characters in one of the tab-delimited SAM fields. I can't see a reason why any field in a valid small RNA-seq BAM file would have over one hundred thousand characters. Did you make this BAM file with ShortStack's aligner? Are there very long reads in it? Are there header lines in the BAM that are extremely long?

The 131072 character limit is a default for Python's csv parser, which ShortStack uses to quickly parse SAM data.

FlaviaPavan commented 1 year ago

Thanks for your quick reply. My data are small RNA-seq, I use bowtie for mapping and I give SS a sorted bam file. I have re-run the mapping, checked that the new bam was correct and I still have the same problem when predicting miRNA.... I have used the new bam file in other scripts which work fine.
I would like to avoid mapping with SS because I want to keep the same pipeline for all my samples. Do you think I could solve the problem in another way ?

MikeAxtell commented 1 year ago

I suspect the BAM is corrupt in some way .. it appears to have an unusually large field with more than 100000 characters in one or more lines. If you want to post it somewhere where I can get it (use my regular email not github) I can take a look.

From: FlaviaPavan @.> Date: Thursday, August 24, 2023 at 9:26 AM To: MikeAxtell/ShortStack @.> Cc: Axtell, Michael @.>, Comment @.> Subject: Re: [MikeAxtell/ShortStack] Problem with de novo miRNA annation with SS 4.0.2 (Issue #138)

Thanks for your quick reply. My data are small RNA-seq, I use bowtie for mapping and I give SS a sorted bam file. I have re-run the mapping, checked that the new bam was correct and I still have the same problem when predicting miRNA.... I have used the new bam file in other scripts which work fine. I would like to avoid mapping with SS because I want to keep the same pipeline for all my samples. Do you think I could solve the problem in another way ?

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/138#issuecomment-1691674254, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCPNKKOB5CUDT45FWVTXW5JADANCNFSM6AAAAAA33IOWWY. You are receiving this because you commented.Message ID: @.***>

MikeAxtell commented 9 months ago

Another user also reported this error (#144) ... I could not trace exactly why it happens (the csv fields are never really that large), but I did make a simple hack to fix it, as of commit 2b8c4c5. This will be included in the next release. Thanks again for the bug report.