Open cpauvert opened 3 years ago
With the issue #18 in mind, I started some time ago to work on this issue and few problems arose.
The search for NCBI Tax ID failed for some participants names (ex: row 1 in table below), probably because of trailing sp.
or spp.
(which seems easy to tackle).
Moreover some participants names are fuzzy (ex: row 5 or 10) which has two consequences:
I am still not sure how to deal with these problems nor exactly how to properly sanitize the names and tax id without hand corrections.
Participant_1 Participant_2 TR1 TR2
1 Acanthamoeba spp. Candidatus Procabacter species genus
2 Acetobacterium woodii Pelobacter acidigallici species species
3 Acinetobacter Pseudomonas putida genus species
4 Alteromonas macleodii Prochlorococcus species genus
5 Ammonia-oxidizing bacteria Nitrite-oxidizing bacteria class class
6 Archaea (ANME-2) Desulfosarcina sp. phylum genus
7 Aspergillus nidulans Streptomyces rapamycinicus species species
8 Azotobacter sp. Alternaria sp. species species
9 Bacillus sp. Debaryomyces vanriji species species
10 Bacteroides ovatus Bacteroides vulgatus and others species species
This seems feasible if relying on the great
taxize
R package from ropensci. The following example: