ERROR: Calculation of read length statistics failed!

msnoon commented 1 year ago

Hi David, I am getting a similar error and I am sure all required tools are on the path. Could you help resolve this??

SequelTools.sh -t Q -v -u subFiles.txt Beginning quality control function

Running in NO_SCRAPS mode Extracting data from .bam files Data extraction was sucessful Beginning calculation of read length statistics Traceback (most recent call last): File "/hdd_scratch1/msn/tools/SequelTools/Scripts/generateReadLenStats_noScraps.py", line 94, in start = int(coord.split("")[0]); stop = int(coord.split("")[1]) ValueError: invalid literal for int() with base 10: 'ccs' ERROR: Calculation of read length statistics failed!

aseetharam commented 1 year ago

Hi @msnoon:

It looks like you are using SequelTools on CCS reads. Unfortunately, it only works on subreads or CLR, but not on CCS. It also works on scrap files of either subread/CLR as well.

Thanks,

msnoon commented 1 year ago

we dont have CLR or scrap files, all we got is ccs file. do you know if there are any tools that could take ccs as input?? or how do I get CLR files??

aseetharam commented 1 year ago

@msnoon: IMHO, there is no need to QC the CCS reads. They are already processed, meaning if the base quality was poor or did not meet certain standards, they are excluded from generating the CCS reads. If you want to calculate some stats regarding the length distribution and/or total bases etc, you could use seqkit stats, once you convert your CCS reads to fasta/fastq format (using samtools fasta)

samtools fasta --threads 16 input_CCS.bam > output.fasta
seqkit stats *.fasta -a

Example output:

file               format  type  num_seqs    sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)
hairpin.fa.gz      FASTA   RNA     28,645  2,949,871       39      103    2,354   76   91  111        0  101       0       0
mature.fa.gz       FASTA   RNA     35,828    781,222       15     21.8       34   21   22   22        0   22       0       0
Illimina1.8.fq.gz  FASTQ   DNA     10,000  1,500,000      150      150      150  150  150  150        0  150   96.16   89.71
reads_1.fq.gz      FASTQ   DNA      2,500    567,516      226      227      229  227  227  227        0  227   91.24   86.62
reads_2.fq.gz      FASTQ   DNA      2,500    560,002      223      224      225  224  224  224        0  224   91.06   87.66

Hope this helps!

Thanks,

msnoon commented 1 year ago

Thank you, Arun!!

ISUgenomics / SequelTools

ERROR: Calculation of read length statistics failed! #19