Alignment of the reads - Githubissues

asan-emirsaleh commented 1 year ago

Hello! I used Quast for times and know I am trying to use the rnaQUAST. Both tools are mentioned as great and robust quality assessment techniques. Some thing are not clear enough to me to use rnaQUAST effectively. As of ordinary alignment procedures takes days until complete, it would be a good idea first to prepare alignment files before the pipeline started and pass them as the input. Are these right:

-sam is a parameter to pass the reads' alignment to the reference genome. From the SAM file the alignments data only would be used, but not the read data. BAM format is not accepted. For reproducibility purposes, the STAR aligner with default parameters is used.
--left_reads and --right_reads parameter are used to pass the read data, so the reads would be aligned to the transcriptome assessed by the STAR aligner. Currently there is no way to pass the previously prepared SAM file as input. Also the read data would be used to align to the genome and compute mapping metrics. For this kind of analysis, the -sam parameter might be used to speed-up the computation runtime.
--reference is used to pass the reference genome data. Currently there is no option to pass the predicted transcriptome sequences.
--gtf parameter is used to pass the gene coordinates of predicted transcripts in reference genome. Both GTF and GFF files are acceptable. This data would be used for gffutils to produce gene databases.
--gmap_index is used to pass the index of the reference genome, that would be used to align the transcriptome on assessing to the reference genome.
-psl is used to pass the PSL file produced by aligning transcriptome on assessing to the reference genome using BLAT aligner. There is no option to pass the prebuild BLAT database.
-meta option is used to assess some metrics dedicated to metatranscriptome assembles. But this option is not documented in the manual page.

The one thing is also not clear for me. What the BLAST aligner is used for? And what is the reason of building the blast databases? Are there an option to pass the prebuild one?

Best regards Asan

andrewprzh commented 1 year ago

Dear @asan-emirsaleh

Sorry for such a long response, rnaQUAST is now only occasionally maintained as some of authors left the lab.

rnaQUAST uses gmap to map contigs to the genome, and BLAST to map contigs to the transcriptome. This way it allows to accurately detect misassemblies, e.g. chimeric contigs reported by both ways. Unfortunately, there is no option to pass existing blast database since rnaQUAST creates transcriptome FASTA by itself based on the annotation. You can also send me command line / log file for check.

I also suggest not to obtain database coverage by reads as it was implemented in a quite inefficient way. If you would like to obtain gene counts, I'd suggest to use e.g. STAR + featureCounts.

Best Andrey

asan-emirsaleh commented 1 year ago

Hi! Thank you for response. As for the blast, there is some reason behind providing pre-build blast database. In some cluster setup such as ours the newest blast major release with makeblastdb working is 2.9. Setting blast to 2.9 causes busco downgrade in conda environment. It is impossible to se both blast= 2.9 and busco=5, because of version conflict error appearing. While using open database-deposited data for the reference purpose, the putative transcriptome is often already known.

ablab / rnaquast

Alignment of the reads #17