COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

Clarify usage for genome alignment #20

Open ssadedin opened 4 years ago

ssadedin commented 4 years ago

Hi,

Thanks for publishing pufferfish! I was interested in trying it for genome alignment, but when I tried to index GRCh38 it printed out a lot of warnings as if it was interpreting it as a transcriptome:

pufferfish index -r Homo_sapiens_assembly38.fasta -o pufferfish
[2020-08-25 22:30:24.855] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers
[2020-08-25 22:30:34.352] [puff::index::jointLog] [warning] Entry with header [chr1] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:38.014] [puff::index::jointLog] [warning] Entry with header [chr2] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:41.037] [puff::index::jointLog] [warning] Entry with header [chr3] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:44.030] [puff::index::jointLog] [warning] Entry with header [chr4] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:46.990] [puff::index::jointLog] [warning] Entry with header [chr5] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.

Is there a different procedure for indexing a genome? Or are these warnings misleading?

NB: it would be also good to know if there is any handling for alt contigs or if these should be removed from a reference to avoid multimapped alignments from them?

Thanks!

rob-p commented 4 years ago

@ssadedin,

Thanks for trying out puffaligner, and for your feedback! These error messages are indeed misleading. The reason for this is that we have a unified codebase (the same pufferfish index that powers puffaligner also powers the selective alignment procedure used in our RNA quantification software salmon). Salmon expects users to index the transcriptome, and so issues these warnings to the user that they may be indexing the wrong thing if they instead try to index the genome. Obviously, indexing of chromosomal contigs is normal / expected behavior for puffaligner if the user is aligning against the genome. We will fix this on the back-end so that these warnings are only issued when the indexer is invoked from salmon.

Regarding alt contigs, you raise a good question. We have not done extensive testing regarding alignments to alt contigs. If you make use of the --bestStrata mode, then puffaligner will look for all equally best alignments and if the alt contig at a locus has the same sequence as the primary, it will return them both. In the case of strict ties in alignment score, the alignment that is marked as "primary" in the SAM record is essentially random. If there are use cases where alternative / custom behavior is likely to be preferred, we'd be happy to discuss!

ssadedin commented 4 years ago

Thanks @rob-p - good to know about the warnings, and thanks for the info on the alt contig situation. I'll do some testing and see what the empirical behaviour is for the alt contigs, and let you know any further thoughts there in a separate issue if necessary.

Thanks!