MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 30 forks source link

shortstack4.0.3 ValueError: invalid literal for int() with base 10: '' #151

Closed joweihsieh closed 2 months ago

joweihsieh commented 6 months ago

Hi Dr. Axtell,

Thanks for developing this useful software!! I've been attempting to utilize shortstack4.0.3 for both known and de novo miRNA prediction for wheat sRNA data. However, I encountered a ValueError: invalid literal for int() with base 10: ''.

Btw, I am able to run miRNA prediction for the Arabidopsis toy example on shortstack4.0.3, and my wheat sRNA datasets performed well on shortstack3.8.5.

Could you please help me in resolving this issue?

Thank you, Jo-Wei

======= Below are my output Beginning run Options: { 'adapter': None, 'align_only': False, 'autotrim': False, 'autotrim_key': 'TCGGACCAGGCTTCATTCCCC', 'bamfile': [ '/home/jhsieh/processed_results/202403_TWG/shortstackv4.0.3/20240424_chrABD/goodDist/merged.bam'], 'dicermax': 24, 'dicermin': 20, 'dn_mirna': True, 'genomefile': '/home/jhsieh/genome/Triticum_aestivum/modifyfromlab/IWGSC2/v1/GenomicSEQ/Original/fasta/iwgsc_refseqv2.1_assembly.fa', 'known_miRNAs': 'tae_known_miRNAs.fasta', 'locifile': None, 'locus': None, 'mincov': 1.0, 'mmap': 'u', 'nohp': False, 'outdir': '/home/jhsieh/processed_results/202403_TWG/shortstackv4.0.3/20240424_chrABD/goodDist_miR3', 'pad': 75, 'readfile': None, 'show_secondaries': False, 'strand_cutoff': 0.8, 'threads': 20} Required executable RNAfold : /home/jhsieh/miniconda3/envs/ShortStack4/bin/RNAfold Required executable strucVis : /home/jhsieh/miniconda3/envs/ShortStack4/bin/strucVis Required executable bowtie : /usr/local/bowtie-1.3.1-linux-x86_64/bowtie Required executable bowtie-build : /usr/local/bowtie-1.3.1-linux-x86_64/bowtie-build Required executable samtools : /usr/local/samtools-1.11/bin/samtools

Mon 06 May 2024 12:44:21 -0500 CDT Defining small RNA clusters de novo With 411496348 total reads and mincov of 1.0 reads per million, the min read depth is 411

Mon 06 May 2024 13:23:03 -0500 CDT Analyzing cluster properties using 20 threads

reads processed: 125

reads with at least one alignment: 119 (95.20%)

reads that failed to align: 6 (4.80%)

Reported 43603 alignments [bam_sort_core] merging from 0 files and 20 in-memory blocks...

Mon 06 May 2024 16:56:59 -0500 CDT Completed

Mon 06 May 2024 16:56:59 -0500 CDT Searching for valid microRNA loci Aligning known_miRNAs sequences to genome

Screening of possible microRNAs from user provided known_miRNAs multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 2359, in mir_analysis q_exact_count = count_exact_location(bedfields, merged_bam) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 2684, in count_exact_location read_count = int(mb.stdout.rstrip()) ^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 3619, in mir_qdata = mirna(args, merged_bam, fai, pmir_bedfile, read_count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 2187, in mirna user_mloci1 = pool.starmap(mir_analysis, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 774, in get raise self._value ValueError: invalid literal for int() with base 10: ''

MikeAxtell commented 6 months ago

I have not encountered this error before nor has it been reported (that I can recall). It looks a failure of a simple sub.process call to samtools, which is used to count aligned reads from a specific location. I can't reproduce the error here without your exact original data. Can you send me your data? You'll need to post them on a cloud location somewhere because of file sizes. I need all input files :

If you can get those data to me I can try and reproduce the error and go from there.

MikeAxtell commented 6 months ago

Try commit 35d16a108f8e70e228e0423e7a9e22545208bc90 and see if your error goes away. I think I might have guessed at the issue, but I don't have any datasets here that are reproducing your error so I can't be sure.

joweihsieh commented 6 months ago

Dear Dr. Axtell,

Thank you for your prompt response and assistance! I will also try commit 35d16a1 later today to see if the errors go away!

Here is the Dropbox link with all the required input files: https://www.dropbox.com/scl/fo/g4pzi3txipxt5tg7eor3d/ACVt6NH9jtV3p1gtIbP5XOM?rlkey=nosuq7o1v6mnnkuljedhsuwuw&dl=0

I have sent an invitation for folder access to your email (mja18@psu.edu). Please let me know if you encounter any issues accessing the folder, and also once you have finished downloading the files.

Thank you, Jo-Wei

MikeAxtell commented 6 months ago

Thanks for posting the data. Unfortunately, your merged.bam file is very large , 189Gb. This is proving difficult to download using the Dropbox web interface. I've tried three times and had failed downloads or corrupted data each time.

I did not see any obvious issue with the genome file or the known_miRNAs file.

Can you instead test the https://github.com/MikeAxtell/ShortStack/commit/35d16a108f8e70e228e0423e7a9e22545208bc90 commit on your end with your data and report if the issue was solved? That commit or any commit later than that and I think the issue may have been solved.

joweihsieh commented 6 months ago

Hi Dr. Axtell,

Thanks for checking the miRNA fasta and genome fasta file for me.

Following your suggestions, I've revised the script accordingly. I am testing it now, but it may take some time due to the large size of the bam file.

I will get back to you once I have the results.

Thanks again, Jo-Wei

joweihsieh commented 6 months ago

Hi Dr. Axtell,

It still has the same errors.

========= Screening of possible microRNAs from user provided known_miRNAs multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 2359, in mir_analysis q_exact_count = count_exact_location(bedfields, merged_bam, fai) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 2690, in count_exact_location read_count = int(mb.stdout.rstrip()) ^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 3625, in mir_qdata = mirna(args, merged_bam, fai, pmir_bedfile, read_count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/bin/ShortStack", line 2187, in mirna user_mloci1 = pool.starmap(mir_analysis, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4/lib/python3.12/multiprocessing/pool.py", line 774, in get raise self._value ValueError: invalid literal for int() with base 10: ''

========= I further have these tests for the original script:

It seems that the ValueError is caused by my bam file.

Jo-Wei

MikeAxtell commented 6 months ago

Yes, I'm working on this. I believe I have identified the issue but need to complete a few more tests ...

MikeAxtell commented 6 months ago

This issue is that parts of ShortStack's microRNA annotation process are not properly dealing with secondary alignments. Your original large BAM file has a lot of secondary alignments (bit 256 set in the FLAG column). If I filter your BAM file to remove all secondary alignments, I get a much smaller file that only contains primary alignments:

samtools view -b -F 256 merged.bam > nosec.bam

When I use that file with ShortStack there were no bugs:

ShortStack --genomefile iwgsc_refseqv2.1_assembly.fa --bamfile nosec.bam --known_miRNAs tae_known_miRNAs.fasta --outdir RelRun1NoSec --threads 10 --dn_mirna

ShortStack's default alignment method does not output secondary alignments, so it seems you made your sRNA-seq alignments with a different program. Nonetheless, ShortStack should be able to handle secondary alignments in a BAM file. I am working through fixing that bug now.

In the meantime, you can use samtools view -b -F 256 merged.bam > nosec.bam and then use the nosec.bam file as input to ShortStack.

Thanks again for reporting the bug.

MikeAxtell commented 6 months ago

OK, commit a70cd65d0b47cc861c48ce2c6eb9bcccee771be5 fixes this issue. Bamfiles with secondary alignments will now work.

This will be included in the next release, version 4.0.4, coming very soon.

joweihsieh commented 6 months ago

Hi Dr. Axtell,

Wow, thank you, that's awesome! I'll try it right away. Thanks for your help!

Using reads without secondary alignments doesn't seem ideal to me, especially considering the presence of multiple subgenomes in my species. This approach would result in a lot of missing information; signals from some subgenomes might not be successfully counted, leading to the absence of miRNA predictions from those subgenomes.

Anyway, many thanks for fixing this!!! :)

Jo-Wei

MikeAxtell commented 6 months ago

To be clear,

ShortStack will always ignore secondary alignments. Each read has a single primary alignment and that is the only one that is analyzed.

When there are multiple reads with identical sequences and that sequence has more than one valid alignment position, depending on the alignment method used, some reads will have a primary at position A, some at position B, and so on. So you don't need secondary alignments to account for multi-mapping reads.

From: joweihsieh @.> Date: Tuesday, May 14, 2024 at 1:55 PM To: MikeAxtell/ShortStack @.> Cc: Axtell, Michael @.>, Assign @.> Subject: Re: [MikeAxtell/ShortStack] shortstack4.0.3 ValueError: invalid literal for int() with base 10: '' (Issue #151)

Hi Dr. Axtell,

Wow, thank you, that's awesome! I'll try it right away. Thanks for your help!

Using reads without secondary alignments doesn't seem ideal to me, especially considering the presence of multiple subgenomes in my species. This approach would result in a lot of missing information; signals from some subgenomes might not be successfully counted, leading to the absence of miRNA predictions from those subgenomes.

Anyway, many thanks for fixing this!!! :)

Jo-Wei

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/151#issuecomment-2110805211, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCJBJJ474IHVDRC7CQDZCJFXRAVCNFSM6AAAAABHJ44K56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQHAYDKMRRGE. You are receiving this because you were assigned.Message ID: @.***>

joweihsieh commented 6 months ago

Hi Dr. Axtell,

Thanks for the clarification and the release of shortstack4.0.4. I am able to run my data using shortstack4.0.4.

Many thanks, Jo-Wei

joweihsieh commented 4 months ago

Hi Dr. Axtell,

Even using ShortStack 4.0.4, I encountered this issue again when trying to analyze BAM files from another species.

Could you please help me again? Thanks, Jo-Wei

======== ShortStack version 4.0.4

Beginning run Options: { 'adapter': None, 'align_only': False, 'autotrim': False, 'autotrim_key': 'TCGGACCAGGCTTCATTCCCC', 'bamfile': [ '10_S2.bow.bam', '11_S7.bow.bam', '12_S8.bow.bam', '173_S11.bow.bam', '186_S2.bow.bam', '187_S3.bow.bam', '215_S18.bow.bam', '216_S19.bow.bam', '217_S20.bow.bam', '220_S21.bow.bam', '221_S22.bow.bam', '222_S23.bow.bam', '226_S24.bow.bam', '227_S25.bow.bam', '228_S26.bow.bam', '229_S27.bow.bam', '232_S27.bow.bam', '235_S18.bow.bam', '8_S1.bow.bam', 'L07_12_12_S37.bow.bam', 'L07_12_13_S38.bow.bam', 'L07_33_11_S42.bow.bam', 'L07_33_12_S43.bow.bam', 'L11_12_11_S45.bow.bam', 'L11_21_12_S5.bow.bam', 'L11_21_13_S6.bow.bam', 'L11_33_11_S7.bow.bam', 'L11_33_12_S8.bow.bam'], 'dicermax': 24, 'dicermin': 20, 'dn_mirna': True, 'genomefile': '/home/jhsieh/genome/Lophopyrum_elongatum/process/v2/TWG_SDAU1_genome.fa', 'known_miRNAs': 'miRBasePlantMirnas.fa', 'locifile': None, 'locus': None, 'mincov': 1.0, 'mmap': 'u', 'no_bigwigs': False, 'nohp': False, 'outdir': './goodDist_miR', 'pad': 75, 'readfile': None, 'show_secondaries': False, 'strand_cutoff': 0.8, 'threads': 10} Required executable RNAfold : /home/jhsieh/miniconda3/envs/ShortStack4.0.4/bin/RNAfold Required executable strucVis : /home/jhsieh/miniconda3/envs/ShortStack4.0.4/bin/strucVis Required executable bowtie : /usr/local/bowtie-1.3.1-linux-x86_64/bowtie Required executable bowtie-build : /usr/local/bowtie-1.3.1-linux-x86_64/bowtie-build Required executable samtools : /usr/local/samtools-1.11/bin/samtools

Fri 21 Jun 2024 00:06:19 -0500 CDT Required bowtie indices not found. Building them ...

Completed

Fri 21 Jun 2024 01:35:04 -0500 CDT Defining small RNA clusters de novo With 351641827 total reads and mincov of 1.0 reads per million, the min read depth is 352

Fri 21 Jun 2024 02:03:40 -0500 CDT Analyzing cluster properties using 10 threads

reads processed: 10414

reads with at least one alignment: 2602 (24.99%)

reads that failed to align: 7812 (75.01%)

Reported 23516 alignments [bam_sort_core] merging from 0 files and 10 in-memory blocks...

Fri 21 Jun 2024 03:22:31 -0500 CDT Completed

Fri 21 Jun 2024 03:22:31 -0500 CDT Searching for valid microRNA loci Aligning known_miRNAs sequences to genome

Screening of possible microRNAs from user provided known_miRNAs multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/lib/python3.12/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/lib/python3.12/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/bin/ShortStack", line 2371, in mir_analysis q_exact_count = count_exact_location(bedfields, merged_bam, fai) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/bin/ShortStack", line 2707, in count_exact_location read_count = int(mb.stdout.rstrip()) ^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/bin/ShortStack", line 3654, in mir_qdata = mirna(args, merged_bam, fai, pmir_bedfile, read_count) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/bin/ShortStack", line 2199, in mirna user_mloci1 = pool.starmap(mir_analysis, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/lib/python3.12/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jhsieh/miniconda3/envs/ShortStack4.0.4/lib/python3.12/multiprocessing/pool.py", line 774, in get raise self._value ValueError: invalid literal for int() with base 10: ''

MikeAxtell commented 4 months ago

Thanks for getting in touch again Jo-Wei. I'm not sure what is going on here. My suggestion is to run the analysis directly from the raw, untrimmed FASTQ files, and to let ShortStack handle the adapter trimming and the read-alignments. My guess, only a guess here, is that your BAM files, created by another process, have formatting that is causing an issue. I've seen issues where users have had trimmed, QC sRNA-seq data in the FASTA format with headers that look like this: >@read1. This will break samtools, upon which ShortStack and many other tools depend.

Anyway try analyzing your data not from the BAM files that you've created using different software, but instead use ShortStack to do the trimming of the raw FASTQ and the subsequent alignments.

MikeAxtell commented 4 months ago

Clarification: >@read1 formatting in FASTA breaks samtools and thus ShortStack because the read names begin with the @ symbol, which is a reserved symbol in the SAM/BAM format (it denotes header lines). There is some company / software which is making this types of small RNA FASTA files, but which are formatted in such a way that samtools cannot parse them.

joweihsieh commented 4 months ago

Hi Dr. Axtell,

Thanks for the quick response and great suggestions!! I will work on it, as suggested.

Best, Jo-Wei

joweihsieh commented 4 months ago

Hi Dr. Axtell,

Just to let you know, I was able to run the script successfully using the trimmed fastq.

Best, Jo-Wei

joweihsieh commented 3 months ago

Hi Dr. Axtell,

Sorry to bother you again. After reviewing the results, I noticed that there is no information recorded in the mir.fasta file, and in the Results.txt, the column for known miRNAs shows "NA" for every row. Do you have any idea on what might be causing this?

Thanks, Jo-Wei

=======================

MikeAxtell commented 2 months ago

Sorry for the very slow response on this.

Your log file indicates two bedtools errors that occurred during the run :

Sat 20 Jul 2024 08:18:40 -0500 CDT
Analyzing cluster properties using 10 threads
ERROR: Received illegal bin number 37460 from getBin call.
ERROR: Unable to add record to tree.
ERROR: Received illegal bin number 37460 from getBin call.
ERROR: Unable to add record to tree.

I've not seen this before and suspect something local to your file system. I'm sorry but I do not know what this is, but I don't suspect a bug in the code with the data I have at present.

AlexSaraz1 commented 2 months ago

Hi Mike, Hi Jo-Wei,

I had a similar error and was able to track it down. Here is Jo-Wei issue:

Required executable samtools : /usr/local/samtools-1.11/bin/samtools

ShortStack make use of the "-e STR" option from samtools. This option is not present in samtools 1.11 (nor was it in the version 1.10 that I was using). It has been included in version 1.12. Without this option, the call to "samtools view -c -e" will result in an empty stdout which leads to "ValueError: invalid literal for int() with base 10: '' when int(X_count.stdout.rstrip()) is applied.

ShortStack README indicates samtools >= 1.16 as requirement (so yes, I should have RTFM).

I think Mike can close the issue.

Have a nice day. Best, Alex

MikeAxtell commented 2 months ago

Thanks Alex, appreciate that. Yes, current ShortStack requires samtools >= 1.16 as specified in manual. I recommend people use conda / bioconda to manage installation for these reasons. The bioconda recipe https://github.com/bioconda/bioconda-recipes/blob/master/recipes/shortstack/meta.yaml for ShortStack includes the version requirements of dependencies, and as such conda will handle it gracefully.

I should have thought of this before, old samtools installations are very common!