bluenote-1577 / sylph

ultrafast taxonomic profiling and genome querying for metagenomic samples by abundance-corrected minhash.
MIT License
131 stars 6 forks source link

Building custom Database #14

Open Natalie-koks opened 2 months ago

Natalie-koks commented 2 months ago

Hello, Please I would like to know if I can create my own database with fungi included. Also, I am using RefSeq224 data files which are *.fna.gz will the command pick that instead of the fa.gz in your example? Thanks

bluenote-1577 commented 2 months ago

Hi @Natalie-koks,

Yes, .fna.gz also work. You can simply do `sylph sketch .fna.gz` to create your db.

Sylph doesn't support taxonomic annotations for fungi yet. See https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph#metaphlan-like-or-cami-like-outputs for how you could build your own metadata for creating a taxonomic profile.

Natalie-koks commented 2 months ago

Thank you very much.

bluenote-1577 commented 1 month ago

@Natalie-koks I've recently updated sylph to include taxonomic annotations for fungi and a new fungi database. See https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases#eukaryotic-databases and https://github.com/bluenote-1577/sylph-utils

humbleflowers commented 1 week ago

Hello @bluenote-1577

I created a custom database with around 20K reference fasta files from NCBI.

i want to create a taxonomy profile using sylph_to_taxprof.py script but it need metadata tsv?

Do you have a script to create metadata for custom database using taxonomy dump from ncbi?

bluenote-1577 commented 1 week ago

Hi @humbleflowers,

Take a look at https://github.com/ctb/2022-assembly-summary-to-lineages by Titus Brown.

From this snakemake workflow, here's what I did (I'll maybe write a tutorial on this later):

1) get the necessary assembly_summary_refseq/genbank.txt file. https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/

2) Use the snakemake repo I linked. For each genome you care about, get the corresponding line in the assembly summary file as input to the Snakemake file (you'll have to call it example.assembly_summary.txt or something like that)

3) You get an 'example.lineages' file from the snakemake file.

4) Edit the script following to convert example.lineages to a metadata file:

def format_taxonomy(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        # Read the header line and ignore it
        header = infile.readline().strip().split(',')

        for line in infile:
            fields = line.strip().split(',')

            ident = fields[0] + '_' + fields[1]
            superkingdom = f"d__{fields[2]}"
            phylum = f"p__{fields[3]}"
            class_ = f"c__{fields[4]}"
            order = f"o__{fields[5]}"
            family = f"f__{fields[6]}"
            genus = f"g__{fields[7]}"
            species = f"s__{fields[8]}"

            # Handle potential empty 'strain' field
            strain = fields[9] if len(fields) > 9 and fields[9] else species

            formatted_line = f"{ident}\t{superkingdom};{phylum};{class_};{order};{family};{genus};{species}"
            outfile.write(formatted_line + '\n')

# Example usage
input_file = './example.lineages.csv'
output_file = 'fungi_refseq_2024-07-25_metadata.tsv'
format_taxonomy(input_file, output_file)