cihga39871 / Atria

An accurate and ultra-fast adapter and quality trimming program for Illumina Next-Generation Sequencing (NGS) data.
Other
33 stars 3 forks source link

Too many arguments error #15

Closed EorgeKit closed 7 months ago

EorgeKit commented 7 months ago

Hello @cihga39871 I have been trying to use atria for the first time but everytime I try to run it I get the too many arguments error as follows: Running script:

#!/usr/bin/env bash

#Data locations
workDir="/media/geokit/Extreme SSD/Eugene"
echo 'work directory is' $workDir
ls "$workDir"
##Data1
Read1_1='/media/geokit/Extreme SSD/Eugene/S1_S1_L001_R1_001.fastq.gz'
Read1_2='/media/geokit/Extreme SSD/Eugene/S1_S1_L001_R2_001.fastq.gz'

##Data2
Read2_1='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R1_001.fastq.gz'
Read2_2='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R2_001.fastq.gz'

##Data3
Read3_1="$workDir/D_1.fastq.gz"
Read3_2="$workDir/D_2.fastq.gz"

#Atria analysis
##Data3
atria \
-r "$Read3_1"   \
-R "$Read3_2" \
-o $workDir/atria_trimming \
--detect-adapter --no-quality-trim \
-t 3 --no-tail-n-trim --max-n=-1 --no-length-filtration

Error:

work directory is /media/geokit/Extreme SSD/Eugene
atria_trimming     D_1.fastq.gz  prtn-g.fasta  S1_S1_L001_R1_001.fastq  S2_S1_L001_R1_001.fastq.gz
atria_trimming.sh  D_2.fastq.gz  prtn-m.fasta  S1_S1_L001_R2_001.fastq  S2_S1_L001_R2_001.fastq.gz
too many arguments
usage: atria [-t INT] [--log2-chunk-size INDEX] [-f]
             -r R1-FASTQ [R1-FASTQ...] [-R [R2-FASTQ...]] [-o PATH]
             [-g AUTO|NO|GZ|GZIP|BZ2|BZIP2] [--check-identifier]
             [--detect-adapter] [-O PROCESS] [--polyG] [--polyT]
             [--polyA] [--polyC] [--poly-length POLY-LENGTH]
             [--poly-mismatch-per-16mer INT] [--no-adapter-trim]
             [-a SEQ [SEQ...]] [-A SEQ [SEQ...]] [-T INT] [-d INT]
             [-D INT] [-s INT] [--trim-score-pe FLOAT]
             [--trim-score-se FLOAT] [-l INT] [--stats]
             [--no-consensus] [--kmer-tolerance-consensus INT]
             [--min-ratio-mismatch FLOAT] [--overlap-score FLOAT]
             [--prob-diff FLOAT] [-b INT] [-B INT] [-e INT] [-E INT]
             [--no-quality-trim] [-q INT] [--quality-kmer INT]
             [--quality-format FORMAT] [--no-tail-n-trim] [-n INT]
             [--no-length-filtration] [--length-range INT:INT]
             [--enable-complexity-filtration] [--min-complexity FLOAT]
             [-p INT] [-C INT] [-c INT]
cihga39871 commented 7 months ago

Hi @EorgeKit ,

Thanks for your interest in Atria.

The workDir contains space, so you need to quote every workDir in script. -o $workDir/atria_trimming needs to be -o "$workDir/atria_trimming".

EorgeKit commented 7 months ago

THanks alot, I have fixed that but now I am getting a new error that I think it might have to do with my fastq files. Trouble is , I have a hard time pinpointing the exact source of the problem since the error is not fully descriptive as to where the problem originates. Also this is happening for two paired end fastq samples that were sequenced together but its not happening to another sample that was sequenced by macrogen, any advice?

code:

#!/usr/bin/env bash

#Data locations
workDir="/media/geokit/Extreme SSD/Eugene"
echo 'work directory is' $workDir
ls "$workDir"
##Data1
Read1_1='/media/geokit/Extreme SSD/Eugene/S1_S1_L001_R1_001.fastq.gz'
Read1_2='/media/geokit/Extreme SSD/Eugene/S1_S1_L001_R2_001.fastq.gz'

##Data2
Read2_1='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R1_001.fastq.gz'
Read2_2='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R2_001.fastq.gz'

##Data3
Read3_1="$workDir/D_1.fastq.gz"
Read3_2="$workDir/D_2.fastq.gz"

#Atria analysis
##Data3
atria \
-r "$Read2_1"   \
-R "$Read2_2" \
-o "$workDir/atria_trimming" \
--detect-adapter --no-quality-trim \
-t 3 --no-tail-n-trim --max-n=-1 --length-range 100:500 

error:

work directory is /media/geokit/Extreme SSD/Eugene
atria_trimming     D_2.fastq.gz  S1_S1_L001_R1_001.fastq       S2_S1_L001_R1_001.atria.log
atria_trimming.sh  prtn-g.fasta  S1_S1_L001_R2_001.fastq       S2_S1_L001_R1_001.fastq.gz
D_1.fastq.gz       prtn-m.fasta  S2_S1_L001_R1_001.atria.fastq.gz  S2_S1_L001_R2_001.fastq.gz
pigz 2.6
TaskFailedException

') at index 321 to BioSequences.DNAAlphabet{4}(). Is the input file valid? Does the disk have bad sections? The error is found in the following context:

    G

    Stacktrace:
     [1] error(s::String)
       @ Base ./error.jl:35
     [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::Vector{UInt8}, soff::UInt64)
       @ BioSequences ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:211
     [3] encode_chunk
       @ ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:225 [inlined]
     [4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::Vector{UInt8}, soff::UInt64, N::UInt64, #unused#::BioSequences.AsciiAlphabet)
       @ BioSequences ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:355
     [5] copyto!
       @ ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:292 [inlined]
     [6] safe_copyto!
       @ ~/atria/Atria/src/FqRecords/copy.jl:31 [inlined]
     [7] #StringChunk2FqRecord!#40
       @ ~/atria/Atria/src/FqRecords/thread_input.jl:916 [inlined]
     [8] (::Atria.FqRecords.var"#27#31"{Int64, NTuple{30, Vector{Atria.FqRecords.FqRecord}}, Int64})()
       @ Atria.FqRecords ./threadingconstructs.jl:258
cihga39871 commented 7 months ago

You need to check one of the fastq file containing an empty line + a single G in the following line. The fastq is not valid

Eric


From: George Kitundu @.> Sent: Friday, March 1, 2024 4:57:24 AM To: cihga39871/Atria @.> Cc: Jiacheng Chuan @.>; State change @.> Subject: Re: [cihga39871/Atria] Too many arguments error (Issue #15)

THanks alot, I have fixed that but now I am getting a new error that I think it might have to do with my fastq files. Trouble is , I have a hard time pinpointing the exact source of the problem since the error is not fully descriptive as to where the problem originates. Also this is happening for two paired end fastq samples that were sequenced together but its not happening to another sample that was sequenced by macrogen, any advice?

code:

!/usr/bin/env bash

Data locations

workDir="/media/geokit/Extreme SSD/Eugene" echo 'work directory is' $workDir ls "$workDir"

Data1

Read1_1='/media/geokit/Extreme SSD/Eugene/S1_S1_L001_R1_001.fastq.gz' Read1_2='/media/geokit/Extreme SSD/Eugene/S1_S1_L001_R2_001.fastq.gz'

Data2

Read2_1='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R1_001.fastq.gz' Read2_2='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R2_001.fastq.gz'

Data3

Read3_1="$workDir/D_1.fastq.gz" Read3_2="$workDir/D_2.fastq.gz"

Atria analysis

Data3

atria \ -r "$Read2_1" \ -R "$Read2_2" \ -o "$workDir/atria_trimming" \ --detect-adapter --no-quality-trim \ -t 3 --no-tail-n-trim --max-n=-1 --length-range 100:500

error:

work directory is /media/geokit/Extreme SSD/Eugene atria_trimming D_2.fastq.gz S1_S1_L001_R1_001.fastq S2_S1_L001_R1_001.atria.log atria_trimming.sh prtn-g.fasta S1_S1_L001_R2_001.fastq S2_S1_L001_R1_001.fastq.gz D_1.fastq.gz prtn-m.fasta S2_S1_L001_R1_001.atria.fastq.gz S2_S1_L001_R2_001.fastq.gz pigz 2.6 TaskFailedException

') at index 321 to BioSequences.DNAAlphabet{4}(). Is the input file valid? Does the disk have bad sections? The error is found in the following context:

G

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::Vector{UInt8}, soff::UInt64)
   @ BioSequences ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:211
 [3] encode_chunk
   @ ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:225 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::Vector{UInt8}, soff::UInt64, N::UInt64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:355
 [5] copyto!
   @ ~/.julia/packages/BioSequences/QcYXq/src/longsequences/copying.jl:292 [inlined]
 [6] safe_copyto!
   @ ~/atria/Atria/src/FqRecords/copy.jl:31 [inlined]
 [7] #StringChunk2FqRecord!#40
   @ ~/atria/Atria/src/FqRecords/thread_input.jl:916 [inlined]
 [8] (::Atria.FqRecords.var"#27#31"{Int64, NTuple{30, Vector{Atria.FqRecords.FqRecord}}, Int64})()
   @ Atria.FqRecords ./threadingconstructs.jl:258

— Reply to this email directly, view it on GitHubhttps://github.com/cihga39871/Atria/issues/15#issuecomment-1972779230, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE4TY4DH2URHBLKTR5BY6VLYWA7HJAVCNFSM6AAAAABD53F5YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZSG43TSMRTGA. You are receiving this because you modified the open/close state.Message ID: @.***>

EorgeKit commented 7 months ago

Noted, I tried to extract about 10 first reads from the file. Upon inspection, there is no any empty line whatsoever but it still throws the same error. Here is the file: subset.txt

cihga39871 commented 7 months ago

Can you run the following and show me the output?


Read2_1='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R1_001.fastq.gz'
Read2_2='/media/geokit/Extreme SSD/Eugene/S2_S1_L001_R2_001.fastq.gz'

zcat  "$Read2_1" | grep -E -n "^G$" 

zcat  "$Read2_2" | grep -E -n "^G$" 
EorgeKit commented 7 months ago

No standard output

ls -lhtr S2_S1_L001_R*
-rwxr-xr-x 1 geokit geokit 25M Jan 14  2023 S2_S1_L001_R1_001.fastq.gz
-rwxr-xr-x 1 geokit geokit 25M Jan 14  2023 S2_S1_L001_R2_001.fastq.gz

zcat S2_S1_L001_R1_001.fastq.gz | grep -E -n "^G$
zcat S2_S1_L001_R2_001.fastq.gz | grep -E -n "^G$
cihga39871 commented 7 months ago

It might be related to an unknown bug. Could you share the two gz files to me? Thanks.

Eric


From: George Kitundu @.> Sent: Friday, March 1, 2024 9:34:02 AM To: cihga39871/Atria @.> Cc: Jiacheng Chuan @.>; State change @.> Subject: Re: [cihga39871/Atria] Too many arguments error (Issue #15)

No standard output

ls -lhtr S2_S1_L001_R* -rwxr-xr-x 1 geokit geokit 25M Jan 14 2023 S2_S1_L001_R1_001.fastq.gz -rwxr-xr-x 1 geokit geokit 25M Jan 14 2023 S2_S1_L001_R2_001.fastq.gz

zcat S2_S1_L001_R1_001.fastq.gz | grep -E -n "^G$ zcat S2_S1_L001_R2_001.fastq.gz | grep -E -n "^G$

— Reply to this email directly, view it on GitHubhttps://github.com/cihga39871/Atria/issues/15#issuecomment-1973215064, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE4TY4B67J6YLAE2PQBR2BTYWB7UVAVCNFSM6AAAAABD53F5YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTGIYTKMBWGQ. You are receiving this because you modified the open/close state.Message ID: @.***>

EorgeKit commented 7 months ago

Absolutely. Here are the files: Data: S2_S1_L001_R1_001.fastq.gz

S2_S1_L001_R2_001.fastq.gz

cihga39871 commented 7 months ago

I found why. The line breaks in your files are \r\n, but usually a fastq file's line break is \n.

The error message was "Cannot encode byte 0x0d (char '\r') at index 321 to DNAAlphabet{4}. ", but the characters before \r was truncated because \r means 'move to the front of line' in Linux.

Can I know how you get those fastq.gz file? Are they from a sequencer directly, or someone process it before sending to you?

cihga39871 commented 7 months ago

Currently, you can use zcat FASTQ | tr -d '\r' > NEW_FASTQ to remove '\r' in files.

EorgeKit commented 7 months ago

I found why. The line breaks in your files are \r\n, but usually a fastq file's line break is \n.

The error message was "Cannot encode byte 0x0d (char '\r') at index 321 to DNAAlphabet{4}. ", but the characters before \r was truncated because \r means 'move to the front of line' in Linux.

Can I know how you get those fastq.gz file? Are they from a sequencer directly, or someone process it before sending to you?

Well this is interesting haha, supposedly these are the ones that came after someone demultiplexed them, possibly the /r issue was introduced at that point

EorgeKit commented 7 months ago

Currently, you can use zcat FASTQ | tr -d '\r' > NEW_FASTQ to remove '\r' in files.

NOted , let me get to it as soon as possible

EorgeKit commented 7 months ago

So I implemented the suggestions and it finally worked. Thanks a lot for the assistance @cihga39871