MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

_csv.Error: field larger than field limit (131072) #144

Closed kaspartom closed 9 months ago

kaspartom commented 9 months ago

Hello, when analyzing cluster properties of my sRNA data, I have encoutered this error:

_csv.Error: field larger than field limit (131072)

I have found this solution to work, when added to your code:

csv.field_size_limit(sys.maxsize)

Is the error somewhere on my side? Running it on Ubuntu 20.04 LTS WSL.

MikeAxtell commented 9 months ago

Thanks for the bug report. Another user also reported this (#138) but it was never resolved because the user did not post their data for me to test ... see that thread. Basically I think it means that there is a malformed or highly unusual BAM/SAM file. The 131072 byte limit for a field in a csv file is a default limit of the python module that ShortStack uses to parse SAM output. Can you post your input data and exact command so I can test? You can send the input datasets off line (gdrive or something).

Yes your hack sounds like it will work fine but the question I have is why are their such huge fields in a small RNA-seq BAM/SAM file to begin with?

kaspartom commented 9 months ago

Hi, thanks for response. I tried to replicate it with smaller subsample of my read files, but this time it finished successfully. So I'm sending you the whole read files and the command that replicates the error for me is this:

ShortStack --genomefile Arabidopsis_thalianaTAIR10.fa --readfile ./sample.fq.gz --known_miRNAs mature_Arabidopsis_miRNA.fa --nohp --dicermin 20 --dicermax 25

The analysis with this end at this step with the csv file siye limit error:

Sat 25 Nov 2023 10:53:58 +0100 CET
Analyzing cluster properties using 1 threads

The files are here on gdrive. I'm adding also BAM file that replicate the error also.

MikeAxtell commented 9 months ago

Thanks! I have retrieved your data from gdrive and I am testing now.

From: Tomáš Kašpar @.> Date: Saturday, November 25, 2023 at 4:59 AM To: MikeAxtell/ShortStack @.> Cc: Axtell, Michael @.>, Assign @.> Subject: Re: [MikeAxtell/ShortStack] _csv.Error: field larger than field limit (131072) (Issue #144)

Hi, thanks for response. I tried to replicate it with smaller subsample of my read files, but this time it finished successfully. So I'm sending you the whole read files and the command that replicates the error for me is this:

ShortStack --genomefile Arabidopsis_thalianaTAIR10.fa --readfile ./sample.fq.gz --known_miRNAs mature_Arabidopsis_miRNA.fa --nohp --dicermin 20 --dicermax 25

The analysis with this end at this step with the csv file siye limit error:

Sat 25 Nov 2023 10:53:58 +0100 CET

Analyzing cluster properties using 1 threads

The files are here on gdrivehttps://drive.google.com/drive/folders/1NcMqw7UflGPWluDq0ZyRjV3y_KK3sBn3?usp=sharing. I'm adding also BAM file that replicate the error also.

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/144#issuecomment-1826269496, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCL43OJBMNJ5GWVO6UTYGG6QHAVCNFSM6AAAAAA7UGDWA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWGI3DSNBZGY. You are receiving this because you were assigned.Message ID: @.***>

MikeAxtell commented 9 months ago

Thanks again. Your data were fine. There is some strange behavior with .csv_reader that I do not understand. It should be reading data line by line, and no single line or field exceeds the byte limit. But, I tried a bunch of things and failed to fix it. So, yes, the simplest fix is as you suggested, to add csv.field_size_limit(sys.maxsize) to the start of the script. This was added in commit 2b8c4c5 and will be included in the next release.

Thank you again for pointing this out and sharing your test data!

kaspartom commented 9 months ago

Glad I could help!