antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
13 stars 4 forks source link

Analyse outputting empty ts/tv file #7

Closed Pkaps25 closed 3 years ago

Pkaps25 commented 3 years ago

Hello

As far as I could tell Haystac does not accept fasta files as sample inputs, only fastq. I have some fasta files I'd like to analyse, so I used BBMap reformat.sh and seqtk seq to convert the fasta files to fastqs with dummy quality score values of 40. The sequence headers are of the form @NC_XXXX-seqN/1, where N is the read number and XXXX are numbers belonging to an NCBI taxon. I have successfully built samples with these fastqs, but analysing using --mode reads results in the errors in the attached log files. It appears that the ts_tv count files are all empty. Do you have any guidance as to how to troubleshoot?

Thank you ts_tv_log_2.txt ts_tv_log.txt

Pkaps25 commented 3 years ago

@antonisdim I have resolved this issue. No reads were aligning to taxa other than Dark Matter causing empty ts/tv files to be generated. I am working with fasta files; does Haystac use quality scores for anything other than the bowtie alignment? I am wondering if it is possible to modify the code minimally to support fasta.

antonisdim commented 3 years ago

Hello Peter,

I hope you are doing great and apologies for the delayed response !

Indeed we only currently support fastq files. I do not think it would be hard to integrate some support for fasta files in a future version of haystac. Of course I'll keep you updated.

Thank you for your patience !

Best, Antony

Pkaps25 commented 3 years ago

Hi Antony,

Thank you for the response. Does Haystac use the quality scores for abundance calculation or dirichlet read assignment. Based on the paper and code I am leaning towards no, but would like to confirm with you.

Thanks again!

antonisdim commented 3 years ago

Hello Peter,

No after the first filtering alignment with bowtie2 base quality scores are not considered. So the individual metagenomic alignments (with bowtie2) and the dirichlet read assignment do not use the base quality info, but instead they focus on the edit distance of the reads.

Hope this helps and please let me know if you have any other questions !

Best, Antony