Open btupper opened 3 months ago
FYI - taking just the first two steps with 100 records...
ids = restez::list_db_ids()
orgs = restez::gb_organism_get(ids)
x = dplyr::tibble(id = ids, org = orgs)
Yields the following... (one for each database)
> x = read_csv("/mnt/storage/data/edna/mednaTaxaRef/egrey/data/taxa/idorg.000_invertebrate_idorg.csv.gz", col_types = "c")
> x
# A tibble: 100 × 2
id org
<chr> <chr>
1 MT172431 Physarum plicatum
2 KT703092 Timema poppensis
3 MN134725 Timema poppensis
4 EU753220 Timema poppensis
5 KR257426 Timema poppensis
6 OD548947 Timema poppensis
7 KR877058 Timema poppensis
8 OE113011 Timema poppensis
9 AB763510 Timema poppensis
10 LK622461 Spirometra erinaceieuropaei
# ℹ 90 more rows
So, now about that some_local_reduction_function_goes_here()
process?
I attempted to merge the two databases (the species list and id-org dump). I did so by using a "pivoted" version of the id-org
data so that org appears once and associated ids are concatenated into a single string...
org, id
"Timema poppensis", "KT703092 MN134725 ... AB763510"
I started by assigning all of Beth's entries into the "other" group. With luck these assignments will be chanced to "plant", "invertebrate", and "vertebrate".
I matched Beth's species list (binomial names): first by exact match (using match(beth, idorg)
) and then by approximate matching (using agrep(beth, idorg)
). For each iteration we only match for inputs that are still in the "other" group.
The output looks like this...
> x
# A tibble: 28,432 × 5
SpeciesBinomial group exact org id
<chr> <chr> <lgl> <chr> <chr>
1 Actias luna invertebrate TRUE Actias luna KC170540 KC172105 KC172397 KC172398 KC172399 KC1772…
2 Notodonta scitipennis invertebrate TRUE Notodonta scitipennis KC189982 KC189983 KC196635 KF152637 KF153107 KF1535…
3 Egira dolosa invertebrate TRUE Egira dolosa KC193646 KC193647 KC193648 KC193649 KC193650 KC1936…
4 Prionus laticollis invertebrate TRUE Prionus laticollis KR204111 KR204209 KR204387 KR204556 KR209768 KX2931…
5 Haploa clymene invertebrate TRUE Haploa clymene KF158688 KF160062 KF160099 KF160159 KF160207 KF1602…
6 Melanoplus bivittatus invertebrate TRUE Melanoplus bivittatus LC214962 LC214963 LC214964 LC362308 LC362775 LC3848…
7 Pasiphila rectangulata invertebrate TRUE Pasiphila rectangulata JX836068 JX836189 KC205783 KC205784 KC209635 KC2096…
8 Libellula pulchella invertebrate TRUE Libellula pulchella JF681014 JF692810 KM224501 KT138840 LC153303 LC2003…
9 Panthea acronyctoides invertebrate TRUE Panthea acronyctoides KC175531 KC175532 KC175533 KC175534 KC175535 KC1755…
10 Argiope aurantia invertebrate TRUE Argiope aurantia AB310501 AB310537 AB324119 AB324148 AB379149 AB3791…
# ℹ 28,422 more rows
# ℹ Use `print(n = ...)` to see more rows
None of the examples of approximate matching are shown, but essential where there are one of more exact matches org
and id
will be concatenated into a single string or org1,org2,...
and id group1 , id group2
. A quick tally of the groupings...
> tab(x$group)
x
invertebrate other plant vertebrate <NA>
9720 9676 7249 1787 0
So, we are still at 10k other.
@btupper add these to the local restez database...
#> • 7 - 'Other mammalian'
#> 349 files and 136 GB
#> • 8 - 'Rodent'
#> 343 files and 139 GB
And then rerun idorg_reducer
using Beth's GNR resolve results using the species
column.
Goodish news!
I added the "Other mammalian" and "Rodent" to our other restez databasii, and then reran the workflows for finding matches between the records in the databasii and Beth's super-cleaned species list. This is a smaller input than we started with last winter, but still plenty big with 18k records. The number of unassigned ("other") is now down to just over 1k.
> x = read_csv("/mnt/storage/data/edna/mednaTaxaRef/egrey/data/idorg/idorg_reducer.000-compact-merged.csv.gz")
# A tibble: 18,039 × 5
# species group exact org id
# <chr> <chr> <lgl> <chr> <chr>
# 1 Actias luna invertebrate TRUE Actias luna KC170540 KC…
# 2 Notodonta scitipennis invertebrate TRUE Notodonta scitipennis KC189982 KC…
# 3 Egira dolosa invertebrate TRUE Egira dolosa KC193646 KC…
# 4 Prionus laticollis invertebrate TRUE Prionus laticollis KR204111 KR…
# 5 Haploa clymene invertebrate TRUE Haploa clymene KF158688 KF…
# 6 Melanoplus bivittatus invertebrate TRUE Melanoplus bivittatus LC214962 LC…
# 7 Pasiphila rectangulata invertebrate TRUE Pasiphila rectangulata JX836068 JX…
# 8 Libellula pulchella invertebrate TRUE Libellula pulchella JF681014 JF…
# 9 Panthea acronyctoides invertebrate TRUE Panthea acronyctoides KC175531 KC…
# 10 Argiope aurantia invertebrate TRUE Argiope aurantia AB310501 AB…
# # ℹ 18,029 more rows
# ℹ Use `print(n = ...)` to see more rows
> count(x, group, exact) |> mutate(fraction = round(n, 3)/nrow(x)*100)
# A tibble: 9 × 4
# group exact n fraction
# <chr> <lgl> <int> <dbl>
# 1 invertebrate FALSE 106 0.588
# 2 invertebrate TRUE 8542 47.4
# 3 mammalian TRUE 92 0.510
# 4 other NA 1007 5.58
# 5 plant FALSE 46 0.255
# 6 plant TRUE 6593 36.5
# 7 rodent TRUE 6 0.0333
# 8 vertebrate FALSE 7 0.0388
# 9 vertebrate TRUE 1640 9.09
Note that all rodents and mammals were exact matches.
I have pulled out the records where exact
is either NA
or FALSE
or group
is "other" (which should have exact
as NA
anyway). It's only (!) 1166 records but perhaps reasonable for an expert to eyeball them? The file is attached.
@bethydavis might we add your workflow for generating the GNR-swept species list to this repo?
At idorg_reducer
and onward, use the long form (one id per row), include locus, and definition which can be found using gb_record_get()
... then use gb_extract(r, ...)
.
gb_extract(
record,
what = c("accession", "organism", "sequence", "definition", "locus" )
)
Generate two files...
metadata accession_id
, organism
, definition
, concatenated_locus
, kingdom
, ..., species
sequence_data accession_id
, sequence
A short update... I set the above in motion for while I was on vacation, but it timed out after 200h. Ooops. I restarted it, and it is producing data like this (invertebrate example.)
> glimpse(x)
Rows: 3,086
Columns: 11
$ accession_id <chr> "KM022481", "KM022938", "KM022939", "KM022940", "KM022941", "KM029989", "KM029990",…
$ organism <chr> "Lygocoris pabulinus", "Notostira elongata", "Tingis ampliata", "Nysius helveticus"…
$ definition <chr> "Lygocoris pabulinus voucher EUBUG_863_f_Lygopabu5 cytochrome oxidase subunit 1 (CO…
$ locus <chr> "KM022481 658 DNA linear INV 21-SEP-2014", "KM022938 658 DNA linear INV 21-SEP-2014…
$ superkingdom <chr> "Eukaryota", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Eukaryota", "Eukaryot…
$ kingdom <chr> "Metazoa", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Metazoa", "Metazoa", "N…
$ phylum <chr> "Arthropoda", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Ctenophora", "Ctenop…
$ class <chr> "Insecta", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Tentaculata", "Tentacul…
$ order <chr> "Hemiptera", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Lobata", "Lobata", "N…
$ family <chr> "Miridae", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Bolinopsidae", "Bolinop…
$ genus <chr> "Lygocoris", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Mnemiopsis", "Mnemiop…
> y
# A tibble: 3,124 × 2
KM022481 TACTCTATACTTCATCTTTGGAATATGAGCAGGAATAGTAGGGACATCAATAAGATGAATTATCCGAATTGAACTAGGAATACCCGGGTCAT…¹
<chr> <chr>
1 KM022938 AAGCCTATATTTTATGTTTGGGATATGAGCGGGGATATTAGGCACATCATTAAGATGAATTATTCGTATTGAATTAGGTATACCTGGCTCATT…
2 KM022939 AACCTTATACTTCATCTTCGGTATATGAGCAGGTATACTAGGAACTTCAATAAGATGAATTATCCGTATTGAATTAAGCCAATCAGGGCCTTT…
3 KM022940 AACTTTATATTTTATTTTCGGAATATGAGCAGGTATAATTGGCTCATCAATAAGATGGATTATTCGAATTGAACTAGGACAACCAGGTACATT…
4 KM022941 AACCCTATATTTTTTATTTGGTATTTGAGCTGGTATATTAGGAACATCACTTAGTTGAATTATTCGTATCGAGTTAGGACATCCTGGATCCTT…
5 KM029989 GATCAAACTATATGGCAATACAGAATTGTTAAATAAGCATTGTTATATAATGAAATGCACTCACACATACACGAGAGAGAGAGGCAGAGAAAG…
6 KM029990 GGTCGAAGTAGTCTGGGTGTTAATCCCACCAGAATTTTAAGCGTATACCAAATTTCCGTTTTCCCAATTCCTTTTGTAATTGTATGCCTGAGC…
7 KM029991 GATCATTTAACCAAGGAGGTTAGAAAGGGAGAGAGGGGGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGTGTGTG…
8 KM029992 GATCTCAAATTATGTTGACCACAAAGTTGCGGAAACACACACAGACACAGACACACAGACACACACACACACACACACACACACACACACACA…
9 KM035379 AGCTTCTATAGTTATTTTTTCTTTACTAACAGTTATACCTTTTGGTGTTTTAATTTTACTTTATCTTTTTGGTAGTTTTTCAATATCTTCTAG…
10 KM035380 AGCCTCTATAGTTATTTTTTCTTTACTAACAGTTATACCTTTTGGTGTTTTAATTTTACTTTATCTTTTTGGTAGCTTTTCCATATCTTCTAG…
# ℹ 3,114 more rows
# ℹ Use `print(n = ...)` to see more rows
I can't noodle out why it is so slow, but I have decided to try chunking the searches in hopes that the local database runs more efficiently with blocks of requests rather than one-at-a-time.
In any event, I may not have any goodies in time for the Thursday meeting.
Hahahaha! I just ran mammals using a chunking approach to requests. It finished the mammals in 10 minutes. Ummmm, it's counterpart running one-at-a-time requests is in its 6th day and is less than half way through. Ooops. Maybe I will make the meeting deadline after all!
Please, someone remind me which id
taxid
is? It's the accession id?Also, note there is a function
restez::gb_extract()
that claims to extract accession ids.I'm not sure what
some_local_reduction_function_goes_here()
does, but I gathered it reduces the length of thetaxids
list.With the reduced
taxids
list we would then...Please feel free to edit this... it's mostly from memory with a few chicken scratches (I really should have been a doctor with this handwriting.)