antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
12 stars 4 forks source link

Custom DB Mappings #10

Closed Pkaps25 closed 2 years ago

Pkaps25 commented 3 years ago

Hello,

I'd like to make sure that I understand how Haystac's alignment works correctly with a custom database. I have a single genomes.fna file that contains reference genomes for Bacteria and HG19, including plasmids, strains, chromosomes, and phages. I built a haystac DB by splitting this large file into ~2000 smaller files by species. Thus, each file may have several fasta headers inside of it. I created a mapping file that species the species name, custom accession, and path to each of these species-level references. The accessions in the mapping file supplied to haystac database --sequences-file are simply those that appear in the first fasta header for that species. I would like to confirm that if a read is mapped to anywhere in one of these reference fastas, Haystac will be able to recognize it.

Thank you for your help.

Pkaps25 commented 3 years ago

Reopening because I'm still looking for clarification

antonisdim commented 3 years ago

Hello Peter,

Apologies for the delay and thank you for the detailed description !

Yes, if a fasta file contains multiple sequences for a certain species (such as bacterial chromosomes and plasmids as you mentioned), haystac will be able to recognise that an alignment is present for said species.

haystac creates a bowtie2genome index out of all the sequences present in a species fasta file, and then uses that index for all subsequent alignments. So a species can have a file with multiple accessions in it (and that is also how for example assemblies can be used for the building a database).

It should be noted that when building a database user specified sequences always take priority. For example if both --sequence-file and --query flags are active, and a species can be found in both of these options, it is the user specified sequence file that will be included in the database.

To make sure that a user provided sequence for a species will be identifiable as such by haystac, we ask users to provide a custom accession for each species. That custom accession is used to create a filename under which the user provided sequence will be stored in the cache directory.

Therefore a file can contain multiple sequences for a certain species, and all the sequences provided in that file will be used at the alignment stage, which in turn contributes to the posterior abundance calculation.

Please do let me know if I have not explained this clearly enough and thank you for your patience !

Best, Antony

Pkaps25 commented 3 years ago

Thank you Antony, this is very clear. As a follow up, is it an issue to include strains of a species in the same fasta file as the species? Could this lead to artifical circularization. Is it the case that each sequence in a fasta file gets treated independently?

antonisdim commented 3 years ago

Hello Peter,

I hope you are doing great !

To be honest that is something I had never though about until now.

If you would like to perform an identification on the strain level you could easily build a custom DB with strains by splitting your species into strains in your DB (please remember to name them appropriately though).

Answering directly to your question, each sequence in a fasta file is included in a bowtie2 index that corresponds to that file, so that means that if e.g. 10 sequences are in that file an index with 10 sequences will be built. I have not tested to see if your suggestion would lead to artificial circularization unfortunately.

What I could possibly suggest (even though I again have not tested that exact scenario) is to include all the strain genomes you are interested in separate fasta files under the species directory (as multiple fasta files can exist in a species directory). So that species will include all the strains you are interested in, your library will be aligned to each strain separately, and the final abundance for that species will take into account all the reads mapping to all the different strains. I would advise to use that solution a bit tentatively though, as I am not sure how this can affect reads being assigned to the grey matter.

Alternatively what I would suggest, if you would like to calculate the abundance of a species by taking into account all the strain variation, is to create a fasta file or files with the core and accessory genes of that species (sort of a pangenome fasta set of files). That should introduce the least amount of bias, if the rest of the species in your database only contain one genome.

Please let me know if that makes sense !

Best, Antony