NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 15 forks source link

get_ref_id sometimes doesn't take the best reference genome for the organism depending on the structure of the MaltExtract output #158

Open ZoePochon opened 3 months ago

ZoePochon commented 3 months ago

Here is a concrete example:

The function implemented in common.smk get_ref_id would extract the first reference genome suggested in the second line. However, this would be specific for the subspecies level in this case. And those references are just for some genes. Normally, it should better look (grep -w ^?) for the line beginning with "Variola_virus" only and then take the best reference genome, which would be here LR800247.1 on the fourth line, not the second line, so at the species level, not at the first subspecies level showing up.

Here is how the file {wildcards.sample}.trimmed.rma6_additionalNodeEntries.txt can look in certain contexts as an example: TargetNode 01 02 03 04 05 06 07 08 09 10 Variola_major_virus Variola_major_virus;_X65516.1;_TOPREFPERCREADS100 Variola_virus;_AF375125.1;_TOPREFPERCREADS100 Variola_virus;_DQ441428.1;_TOPREFPERCREADS100 Variola_virus;_DQ437581.1;_TOPREFPERCREADS100 Variola_virus;_AF375129.1;_TOPREFPERCREADS100 Variola_virus;_DQ441445.1;_TOPREFPERCREADS100 Variola_virus;_DQ441424.1;_TOPREFPERCREADS100 Variola_virus;_DQ437585.1;_TOPREFPERCREADS100 Variola_virus;_EF611223.1;_TOPREFPERCREADS100 Variola_virus;_KY358055.1;_TOPREFPERCREADS100 Variola_minor_virus root;_U18339.1;_TOPREFPERCREADS100 Variola_virus;_BK010317.1;_TOPREFPERCREADS100 Variola_minor_virus;_X72086.1;_TOPREFPERCREADS100 Variola_virus;_DQ441445.1;_TOPREFPERCREADS100 Variola_virus;_DQ437585.1;_TOPREFPERCREADS100 Variola_virus;_DQ441447.1;_TOPREFPERCREADS100 Variola_virus;_DQ437583.1;_TOPREFPERCREADS100 Variola_virus;_DQ441441.1;_TOPREFPERCREADS100 Variola_virus;_DQ437589.1;_TOPREFPERCREADS100 Variola_virus;_DQ441443.1;_TOPREFPERCREADS100 Variola_virus Variola_virus;_LR800247.1;_TOPREFPERCREADS100 root;_U18338.1;_TOPREFPERCREADS008 root;_X76267.1;_TOPREFPERCREADS007 root;_X76268.1;_TOPREFPERCREADS007 root;_X76265.1;_TOPREFPERCREADS005 root;_X76264.1;_TOPREFPERCREADS003 root;_X76266.1;_TOPREFPERCREADS003 root;_U18339.1;_TOPREFPERCREADS003 root;_X76263.1;_TOPREFPERCREADS002 root;_X76262.1;_TOPREFPERCREADS001

ZoePochon commented 2 months ago

We could use the node_list.txt information with the species name in it, replace the space with an underscore and grep all the files used by the pipeline located in the MaltExtract_output folder for the line with only Variola_virus instead of always taking the first line as a default This would mostly mean modifying the common.smk get_id rule and the authentic.smk as the rules in there use several files from the MaltExtract_output folder as input.