BigelowLab / referee

Scripts for building reference databases
MIT License
0 stars 1 forks source link

A generalized path from binomial name to accession id - on the cheap #2

Open btupper opened 3 months ago

btupper commented 3 months ago

Given a list of binomial names, how can I get accession ids and sequences?

For each database in ['plant', 'vertebrate', 'invertebrate] do
  taxids = restez::list_db_ids() |>
    restez::gb_organism_get() |>
    some_local_reduction_function_goes_here() |>
    taxizedb::name2taxid()

Please, someone remind me which id taxid is? It's the accession id?

Also, note there is a function restez::gb_extract() that claims to extract accession ids.

I'm not sure what some_local_reduction_function_goes_here() does, but I gathered it reduces the length of the taxids list.

With the reduced taxids list we would then...

For each taxid in taxids do
  rec = restez::gb_record_get(taxid)
  accid = rec |>
    restez::gb_extract("accession") 
  seq = rec |>
    restez::gb_extract("sequence") 

Please feel free to edit this... it's mostly from memory with a few chicken scratches (I really should have been a doctor with this handwriting.)

btupper commented 3 months ago

FYI - taking just the first two steps with 100 records...

ids = restez::list_db_ids()
orgs = restez::gb_organism_get(ids)

x = dplyr::tibble(id = ids, org = orgs)

Yields the following... (one for each database)

> x = read_csv("/mnt/storage/data/edna/mednaTaxaRef/egrey/data/taxa/idorg.000_invertebrate_idorg.csv.gz", col_types = "c")
> x                                                                           
# A tibble: 100 × 2
   id       org                        
   <chr>    <chr>                      
 1 MT172431 Physarum plicatum          
 2 KT703092 Timema poppensis           
 3 MN134725 Timema poppensis           
 4 EU753220 Timema poppensis           
 5 KR257426 Timema poppensis           
 6 OD548947 Timema poppensis           
 7 KR877058 Timema poppensis           
 8 OE113011 Timema poppensis           
 9 AB763510 Timema poppensis           
10 LK622461 Spirometra erinaceieuropaei
# ℹ 90 more rows

So, now about that some_local_reduction_function_goes_here() process?

btupper commented 3 months ago

I attempted to merge the two databases (the species list and id-org dump). I did so by using a "pivoted" version of the id-org data so that org appears once and associated ids are concatenated into a single string...

               org,                               id 
"Timema poppensis", "KT703092 MN134725 ... AB763510"

I started by assigning all of Beth's entries into the "other" group. With luck these assignments will be chanced to "plant", "invertebrate", and "vertebrate".

I matched Beth's species list (binomial names): first by exact match (using match(beth, idorg)) and then by approximate matching (using agrep(beth, idorg)). For each iteration we only match for inputs that are still in the "other" group.

The output looks like this...

> x
# A tibble: 28,432 × 5
   SpeciesBinomial        group        exact org                    id                                                  
   <chr>                  <chr>        <lgl> <chr>                  <chr>                                               
 1 Actias luna            invertebrate TRUE  Actias luna            KC170540 KC172105 KC172397 KC172398 KC172399 KC1772…
 2 Notodonta scitipennis  invertebrate TRUE  Notodonta scitipennis  KC189982 KC189983 KC196635 KF152637 KF153107 KF1535…
 3 Egira dolosa           invertebrate TRUE  Egira dolosa           KC193646 KC193647 KC193648 KC193649 KC193650 KC1936…
 4 Prionus laticollis     invertebrate TRUE  Prionus laticollis     KR204111 KR204209 KR204387 KR204556 KR209768 KX2931…
 5 Haploa clymene         invertebrate TRUE  Haploa clymene         KF158688 KF160062 KF160099 KF160159 KF160207 KF1602…
 6 Melanoplus bivittatus  invertebrate TRUE  Melanoplus bivittatus  LC214962 LC214963 LC214964 LC362308 LC362775 LC3848…
 7 Pasiphila rectangulata invertebrate TRUE  Pasiphila rectangulata JX836068 JX836189 KC205783 KC205784 KC209635 KC2096…
 8 Libellula pulchella    invertebrate TRUE  Libellula pulchella    JF681014 JF692810 KM224501 KT138840 LC153303 LC2003…
 9 Panthea acronyctoides  invertebrate TRUE  Panthea acronyctoides  KC175531 KC175532 KC175533 KC175534 KC175535 KC1755…
10 Argiope aurantia       invertebrate TRUE  Argiope aurantia       AB310501 AB310537 AB324119 AB324148 AB379149 AB3791…
# ℹ 28,422 more rows
# ℹ Use `print(n = ...)` to see more rows

None of the examples of approximate matching are shown, but essential where there are one of more exact matches org and id will be concatenated into a single string or org1,org2,... and id group1 , id group2. A quick tally of the groupings...

> tab(x$group)
x
invertebrate        other        plant   vertebrate         <NA> 
        9720         9676         7249         1787            0 

So, we are still at 10k other.

btupper commented 3 months ago

@btupper add these to the local restez database...

#> • 7  - 'Other mammalian'
#>         349 files and 136 GB
#> • 8  - 'Rodent'
#>         343 files and 139 GB

And then rerun idorg_reducer using Beth's GNR resolve results using the species column.

btupper commented 2 months ago

Goodish news!

I added the "Other mammalian" and "Rodent" to our other restez databasii, and then reran the workflows for finding matches between the records in the databasii and Beth's super-cleaned species list. This is a smaller input than we started with last winter, but still plenty big with 18k records. The number of unassigned ("other") is now down to just over 1k.

> x = read_csv("/mnt/storage/data/edna/mednaTaxaRef/egrey/data/idorg/idorg_reducer.000-compact-merged.csv.gz")

#  A tibble: 18,039 × 5
#    species                group        exact org                    id          
#    <chr>                  <chr>        <lgl> <chr>                  <chr>       
#  1 Actias luna            invertebrate TRUE  Actias luna            KC170540 KC…
#  2 Notodonta scitipennis  invertebrate TRUE  Notodonta scitipennis  KC189982 KC…
#  3 Egira dolosa           invertebrate TRUE  Egira dolosa           KC193646 KC…
#  4 Prionus laticollis     invertebrate TRUE  Prionus laticollis     KR204111 KR…
#  5 Haploa clymene         invertebrate TRUE  Haploa clymene         KF158688 KF…
#  6 Melanoplus bivittatus  invertebrate TRUE  Melanoplus bivittatus  LC214962 LC…
#  7 Pasiphila rectangulata invertebrate TRUE  Pasiphila rectangulata JX836068 JX…
#  8 Libellula pulchella    invertebrate TRUE  Libellula pulchella    JF681014 JF…
#  9 Panthea acronyctoides  invertebrate TRUE  Panthea acronyctoides  KC175531 KC…
# 10 Argiope aurantia       invertebrate TRUE  Argiope aurantia       AB310501 AB…
# # ℹ 18,029 more rows
# ℹ Use `print(n = ...)` to see more rows
> count(x, group, exact) |> mutate(fraction = round(n, 3)/nrow(x)*100)
# A tibble: 9 × 4
#   group        exact     n fraction
#   <chr>        <lgl> <int>    <dbl>
# 1 invertebrate FALSE   106   0.588 
# 2 invertebrate TRUE   8542  47.4   
# 3 mammalian    TRUE     92   0.510 
# 4 other        NA     1007   5.58  
# 5 plant        FALSE    46   0.255 
# 6 plant        TRUE   6593  36.5   
# 7 rodent       TRUE      6   0.0333
# 8 vertebrate   FALSE     7   0.0388
# 9 vertebrate   TRUE   1640   9.09  

Note that all rodents and mammals were exact matches.

I have pulled out the records where exact is either NA or FALSE or group is "other" (which should have exact as NA anyway). It's only (!) 1166 records but perhaps reasonable for an expert to eyeball them? The file is attached.

@bethydavis might we add your workflow for generating the GNR-swept species list to this repo?

idorg_reducer.000-compact-merged-manual.csv.gz

btupper commented 2 months ago

At idorg_reducer and onward, use the long form (one id per row), include locus, and definition which can be found using gb_record_get() ... then use gb_extract(r, ...).

gb_extract(
  record,
  what = c("accession",  "organism", "sequence", "definition", "locus" )
)

Generate two files...

btupper commented 2 months ago

A short update... I set the above in motion for while I was on vacation, but it timed out after 200h. Ooops. I restarted it, and it is producing data like this (invertebrate example.)

Metadata

> glimpse(x)
Rows: 3,086
Columns: 11
$ accession_id <chr> "KM022481", "KM022938", "KM022939", "KM022940", "KM022941", "KM029989", "KM029990",…
$ organism     <chr> "Lygocoris pabulinus", "Notostira elongata", "Tingis ampliata", "Nysius helveticus"…
$ definition   <chr> "Lygocoris pabulinus voucher EUBUG_863_f_Lygopabu5 cytochrome oxidase subunit 1 (CO…
$ locus        <chr> "KM022481 658 DNA linear INV 21-SEP-2014", "KM022938 658 DNA linear INV 21-SEP-2014…
$ superkingdom <chr> "Eukaryota", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Eukaryota", "Eukaryot…
$ kingdom      <chr> "Metazoa", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Metazoa", "Metazoa", "N…
$ phylum       <chr> "Arthropoda", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Ctenophora", "Ctenop…
$ class        <chr> "Insecta", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Tentaculata", "Tentacul…
$ order        <chr> "Hemiptera", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Lobata", "Lobata", "N…
$ family       <chr> "Miridae", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Bolinopsidae", "Bolinop…
$ genus        <chr> "Lygocoris", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "Mnemiopsis", "Mnemiop…

Sequence data

> y
# A tibble: 3,124 × 2
   KM022481 TACTCTATACTTCATCTTTGGAATATGAGCAGGAATAGTAGGGACATCAATAAGATGAATTATCCGAATTGAACTAGGAATACCCGGGTCAT…¹
   <chr>    <chr>                                                                                         
 1 KM022938 AAGCCTATATTTTATGTTTGGGATATGAGCGGGGATATTAGGCACATCATTAAGATGAATTATTCGTATTGAATTAGGTATACCTGGCTCATT…
 2 KM022939 AACCTTATACTTCATCTTCGGTATATGAGCAGGTATACTAGGAACTTCAATAAGATGAATTATCCGTATTGAATTAAGCCAATCAGGGCCTTT…
 3 KM022940 AACTTTATATTTTATTTTCGGAATATGAGCAGGTATAATTGGCTCATCAATAAGATGGATTATTCGAATTGAACTAGGACAACCAGGTACATT…
 4 KM022941 AACCCTATATTTTTTATTTGGTATTTGAGCTGGTATATTAGGAACATCACTTAGTTGAATTATTCGTATCGAGTTAGGACATCCTGGATCCTT…
 5 KM029989 GATCAAACTATATGGCAATACAGAATTGTTAAATAAGCATTGTTATATAATGAAATGCACTCACACATACACGAGAGAGAGAGGCAGAGAAAG…
 6 KM029990 GGTCGAAGTAGTCTGGGTGTTAATCCCACCAGAATTTTAAGCGTATACCAAATTTCCGTTTTCCCAATTCCTTTTGTAATTGTATGCCTGAGC…
 7 KM029991 GATCATTTAACCAAGGAGGTTAGAAAGGGAGAGAGGGGGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGTGTGTG…
 8 KM029992 GATCTCAAATTATGTTGACCACAAAGTTGCGGAAACACACACAGACACAGACACACAGACACACACACACACACACACACACACACACACACA…
 9 KM035379 AGCTTCTATAGTTATTTTTTCTTTACTAACAGTTATACCTTTTGGTGTTTTAATTTTACTTTATCTTTTTGGTAGTTTTTCAATATCTTCTAG…
10 KM035380 AGCCTCTATAGTTATTTTTTCTTTACTAACAGTTATACCTTTTGGTGTTTTAATTTTACTTTATCTTTTTGGTAGCTTTTCCATATCTTCTAG…
# ℹ 3,114 more rows
# ℹ Use `print(n = ...)` to see more rows

I can't noodle out why it is so slow, but I have decided to try chunking the searches in hopes that the local database runs more efficiently with blocks of requests rather than one-at-a-time.

In any event, I may not have any goodies in time for the Thursday meeting.

btupper commented 1 month ago

Hahahaha! I just ran mammals using a chunking approach to requests. It finished the mammals in 10 minutes. Ummmm, it's counterpart running one-at-a-time requests is in its 6th day and is less than half way through. Ooops. Maybe I will make the meeting deadline after all!