Specify the strain and assembly we want for a particular organism.
run a foreman command where we pass in an organism name and a division.
In the command we then pull down the entire division which is a list of species.
We then search the list for the specific species that matches our organism and strain.
The main issue is how we do that because the species are there but their name does not match our current approach outlined below.
The current approach to determining the species name we do the following steps.
concatenate [organism]_[strain] for the current organism
iterate over the entire division and check if the species name attribute contains our concatenated mapping as a substring
if it does we change the name of the species to the concatenated value
otherwise we assume the species is not present
Problem or idea
The problem is that the names are inconsistent. Only 1 of the 4 bacteria and 0 of the 1 fungi were able to be found in their division.
The lowest effort way to fix this would be to check against the assembly in the species. Below is what would have been match for the 4 bacteria in the issue at the top of this issue. This looks correct to me but I am unsure if this is a reliable way to determine species since I am only comparing assembly accession.
This is what we can get after filtering by only assembly accession.
The first item below was the only one that was successful with the current approach.
As discussed synchronously today, we should match on assembly. If the relationship is 1:1, continue with processing. If there are multiple results for genome builds returned, throw an error.
Context
Unable to create new bacteria transcriptome indices after merging in new mappings via https://github.com/AlexsLemonade/refinebio/pull/3366.
In order to create a new transcriptome index:
strain
andassembly
we want for a particularorganism
.organism
name and a division.species
.The main issue is how we do that because the species are there but their name does not match our current approach outlined below.
The current approach to determining the species name we do the following steps.
[organism]_[strain]
for the current organismspecies
name attribute contains our concatenated mapping as a substringProblem or idea
The problem is that the names are inconsistent. Only 1 of the 4 bacteria and 0 of the 1 fungi were able to be found in their division.
The lowest effort way to fix this would be to check against the
assembly
in the species. Below is what would have been match for the 4 bacteria in the issue at the top of this issue. This looks correct to me but I am unsure if this is a reliable way to determine species since I am only comparing assembly accession.This is what we can get after filtering by only assembly accession. The first item below was the only one that was successful with the current approach.
Solution or next step
tagging @jaclyn-taroni for approval