AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
126 stars 19 forks source link

EnsemblBacteria and EnsembleFungi species name is different... sometimes #3381

Open davidsmejia opened 9 months ago

davidsmejia commented 9 months ago

Context

Unable to create new bacteria transcriptome indices after merging in new mappings via https://github.com/AlexsLemonade/refinebio/pull/3366.

In order to create a new transcriptome index:

The main issue is how we do that because the species are there but their name does not match our current approach outlined below.

The current approach to determining the species name we do the following steps.

  1. concatenate [organism]_[strain] for the current organism
  2. iterate over the entire division and check if the species name attribute contains our concatenated mapping as a substring
    • if it does we change the name of the species to the concatenated value
    • otherwise we assume the species is not present

Problem or idea

The problem is that the names are inconsistent. Only 1 of the 4 bacteria and 0 of the 1 fungi were able to be found in their division.

The lowest effort way to fix this would be to check against the assembly in the species. Below is what would have been match for the 4 bacteria in the issue at the top of this issue. This looks correct to me but I am unsure if this is a reliable way to determine species since I am only comparing assembly accession.

This is what we can get after filtering by only assembly accession. The first item below was the only one that was successful with the current approach.

[
  {
    serotype: null,
    assembly_ucsc: null,
    url_name: 'Streptococcus_pneumoniae_tigr4_gca_000006885',
    has_other_alignments: 1,
    species_taxonomy_id: 1313,
    has_variations: 0,
    assembly_default: 'ASM688v1',
    data_release_id: 73,
    assembly_id: 387,
    has_peptide_compara: 0,
    scientific_name: 'Streptococcus pneumoniae TIGR4 (GCA_000006885)',
    assembly_name: 'ASM688v1',
    taxonomy_id: 170187,
    website_packed: 0,
    has_synteny: 0,
    assembly_accession: 'GCA_000006885.1',
    reference: null,
    has_genome_alignments: 0,
    division: 'EnsemblBacteria',
    base_count: 2160842,
    display_name: 'Streptococcus pneumoniae TIGR4 (GCA_000006885)',
    organism_id: 51838,
    genebuild: '2014-05-TIGR',
    assembly_level: 'chromosome',
    has_pan_compara: 1,
    strain: null,
    has_microarray: 0,
    name: 'streptococcus_pneumoniae_tigr4_gca_000006885'
  },
  {
    reference: null,
    assembly_accession: 'GCA_000016305.1',
    scientific_name: 'Klebsiella pneumoniae subsp. pneumoniae MGH 78578 (GCA_000016305)',
    assembly_name: 'ASM1630v1',
    taxonomy_id: 272620,
    has_synteny: 0,
    website_packed: 0,
    has_other_alignments: 1,
    data_release_id: 73,
    assembly_default: 'ASM1630v1',
    assembly_id: 448,
    has_peptide_compara: 0,
    species_taxonomy_id: 573,
    has_variations: 0,
    url_name: 'Klebsiella_pneumoniae_subsp_pneumoniae_mgh_78578_gca_000016305',
    assembly_ucsc: null,
    serotype: null,
    strain: null,
    has_microarray: 0,
    name: 'klebsiella_pneumoniae_subsp_pneumoniae_mgh_78578_gca_000016305',
    assembly_level: 'chromosome',
    genebuild: '2016-06-TheKlebsiellapneumoniaeGenomeSequencingProject',
    has_pan_compara: 1,
    organism_id: 51919,
    display_name: 'Klebsiella pneumoniae subsp. pneumoniae MGH 78578 (GCA_000016305)',
    division: 'EnsemblBacteria',
    has_genome_alignments: 0,
    base_count: 5694894
  },
  {
    base_count: 4851126,
    division: 'EnsemblBacteria',
    has_genome_alignments: 0,
    organism_id: 51923,
    display_name: 'Stenotrophomonas maltophilia K279a (GCA_000072485)',
    has_pan_compara: 1,
    assembly_level: 'chromosome',
    genebuild: '2015-02-WellcomeTrustSangerInstitute',
    has_microarray: 0,
    name: 'stenotrophomonas_maltophilia_k279a_gca_000072485',
    strain: null,
    url_name: 'Stenotrophomonas_maltophilia_k279a_gca_000072485',
    assembly_ucsc: null,
    serotype: null,
    data_release_id: 73,
    assembly_default: 'ASM7248v1',
    assembly_id: 453,
    has_peptide_compara: 0,
    has_variations: 0,
    species_taxonomy_id: 40324,
    has_other_alignments: 1,
    taxonomy_id: 522373,
    assembly_name: 'ASM7248v1',
    has_synteny: 0,
    website_packed: 0,
    scientific_name: 'Stenotrophomonas maltophilia K279a (GCA_000072485)',
    reference: null,
    assembly_accession: 'GCA_000072485.1'
  },
  {
    url_name: 'Staphylococcus_aureus_subsp_aureus_usa300_fpr3757_gca_000013465',
    assembly_ucsc: null,
    serotype: null,
    has_peptide_compara: 0,
    assembly_default: 'ASM1346v1_',
    assembly_id: 102704,
    data_release_id: 73,
    has_variations: 0,
    species_taxonomy_id: 451515,
    has_other_alignments: 0,
    taxonomy_id: 451515,
    assembly_name: 'ASM1346v1',
    has_synteny: 0,
    website_packed: 0,
    scientific_name: 'Staphylococcus aureus subsp. aureus USA300_FPR3757 (GCA_000013465)',
    reference: null,
    assembly_accession: 'GCA_000013465.1',
    base_count: 2917469,
    division: 'EnsemblBacteria',
    has_genome_alignments: 0,
    organism_id: 80886,
    display_name: 'Staphylococcus aureus subsp. aureus USA300_FPR3757 (GCA_000013465)',
    has_pan_compara: 0,
    assembly_level: 'primary_assembly',
    genebuild: '2022-12-Prokka',
    name: 'staphylococcus_aureus_subsp_aureus_usa300_fpr3757_gca_000013465',
    has_microarray: 0,
    strain: null
  }
]

Solution or next step

tagging @jaclyn-taroni for approval

jaclyn-taroni commented 9 months ago

As discussed synchronously today, we should match on assembly. If the relationship is 1:1, continue with processing. If there are multiple results for genome builds returned, throw an error.