Maine-eDNA / mednaTaxaRef

Development of an R toolset to generate reference databases for use in Maine-eDNA sequence analyses based on merging existing functionalities
2 stars 2 forks source link

using local blast tools in lieu of entrez #12

Open btupper opened 5 months ago

btupper commented 5 months ago

I shared the local NCBI database search idea with Julia Brown at Bigelow. She thinks that we may be able to leverage the blastdbcmd to replace some of the entrez functionality we use now.

@egreyavis Perhaps we can set up a time to walk through the examples with an eye toward building some R wrappers.

btupper commented 5 months ago

@btupper try https://docs.ropensci.org/restez/articles/restez.html on charlie

btupper commented 5 months ago

@egreyavis which databases (databasii?) are of interest to the project?

btupper commented 5 months ago

SO, I downloaded 1,2 and 6 ('Invertebrate', 'Plant (including fungi and algae)' and 'Other vertebrate') and then ran restez::db_create() I get this error...

Inspecting 4322 file(s) to add to the database ...
... 'gbinv1.seq.gz' (1/4322)
... 'gbinv10.seq.gz' (2/4322)
... 'gbinv100.seq.gz' (3/4322)
... 'gbinv1000.seq.gz' (4/4322)
... 'gbinv1001.seq.gz' (5/4322)
... 'gbinv1002.seq.gz' (6/4322)
... 'gbinv1003.seq.gz' (7/4322)
... 'gbinv1004.seq.gz' (8/4322)
... 'gbinv1005.seq.gz' (9/4322)
... 'gbinv1006.seq.gz' (10/4322)
... 'gbinv1007.seq.gz' (11/4322)
... 'gbinv1008.seq.gz' (12/4322)
... 'gbinv1009.seq.gz' (13/4322)
... 'gbinv101.seq.gz' (14/4322)
... 'gbinv1010.seq.gz' (15/4322)
... 'gbinv1011.seq.gz' (16/4322)
... 'gbinv1012.seq.gz' (17/4322)
... 'gbinv1013.seq.gz' (18/4322)
... 'gbinv1014.seq.gz' (19/4322)
... 'gbinv1015.seq.gz' (20/4322)
... 'gbinv1016.seq.gz' (21/4322)
... 'gbinv1017.seq.gz' (22/4322)
... 'gbinv1018.seq.gz' (23/4322)
Error in paste0(lines[indexes], collapse = "\n") : 
  result would exceed 2^31-1 bytes
Calls: db_create ... gb_build -> flatfile_read -> lapply -> FUN -> paste0
In addition: There were 23 warnings (use warnings() to see them)

I think this exceeds R's limit. I'll have to investigate. If needed, can we run the process on the 3 databasii separately and then merge the results?

egreyavis commented 5 months ago

Yes we could run them separately and then merge.

btupper commented 5 months ago

OK - I'll set that up and see what happens

btupper commented 5 months ago

I'm thinking we should set the max_length argument to avoid that error. (It's not really an error but a limitation of character lengths in R - who knew one might want 2^31 characters in a sequence?). I'm not sure about downstream consequences, but I suspect that it would allow us to proceed. How about max_length = 10^9 which is about half the 2^31 at 1 billion.

egreyavis commented 5 months ago

Sure that's fine!

Erin K. Grey, PhD Phone: (773) 401-9849 Web: www.egreylab.com Email: @.***

On Mon, Apr 8, 2024 at 8:52 AM Ben Tupper @.***> wrote:

I'm thinking we should set the max_length argument to avoid that error. (It's not really an error but a limitation of character lengths in R - who knew one might want 2^31 characters in a sequence?). I'm not sure about downstream consequences, but I suspect that it would allow us to proceed. How about max_length = 10^9 which is about half the 2^31 at 1 billion.

— Reply to this email directly, view it on GitHub https://github.com/Maine-eDNA/mednaTaxaRef/issues/12#issuecomment-2042677902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3FEDPO5SNUXNLHV2GYJFTY4KHILAVCNFSM6AAAAABE73XBP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGY3TOOJQGI . You are receiving this because you were mentioned.Message ID: @.***>

btupper commented 5 months ago

I have tried a number of different max_length values (10^6, 10^15, 10^20, etc) and I still encounter that error on occasion. I kicked all three (invertebrates, plants and vertebrates) this morning - each using a yaml similar tot he follwoing.

name: invertebrate
rootpath: /mnt/storage/data/edna/refdb/restez
download:
  preselection: "1"
  db: nucleotide
  overwrite: TRUE
  max_tries: 3
create:
  db_type: nucleotide
  max_length: 1e6
  min_length: 1

Invertebrates bailed early with that same error. Plants and vertebrates are still running.

btupper commented 5 months ago

Good news! The vertebrates database was successfully built.

Bad news! Plants joined invertebrates in failing to build.

btupper commented 5 months ago

I have cloned the restez package, and have added error trapping/handling to the bit of code that flops. I'll build the package on "charlie" and give it a whirl. If that resolves the issues (by skipping the big ones) then we are at least unblocked. I'll keep you posted.

egreyavis commented 5 months ago

Thanks Ben.

On Fri, Apr 12, 2024, 9:41 AM Ben Tupper @.***> wrote:

I have cloned the restez package, and have added error trapping/handling to the bit of code that flops. I'll build the package on "charlie" and give it a whirl. If that resolves the issues (by skipping the big ones) then we are at least unblocked. I'll keep you posted.

— Reply to this email directly, view it on GitHub https://github.com/Maine-eDNA/mednaTaxaRef/issues/12#issuecomment-2051787063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3FEDKIVUZ3Z6ISZBM4XNLY47QB5AVCNFSM6AAAAABE73XBP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRG44DOMBWGM . You are receiving this because you were mentioned.Message ID: @.***>

btupper commented 4 months ago

Good news! Three databases (databasii?) downloaded an operational...

INFO [2024-04-23 12:19:49] db_name: invertebrate
INFO [2024-04-23 12:19:49] db_ready: TRUE
Checking setup status at  ...
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Restez path ...
... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez'
... Does path exist? 'Yes'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Download ...
... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez/downloads'
... Does path exist? 'Yes'
... N. files 2100
... Total size 261G
... GenBank division selections 'Invertebrate'
... GenBank Release 259
... Last updated '2024-04-08 10:32:31'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Database ...
... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez/sql_db'
... Does path exist? 'Yes'
... Total size 404G
... Does the database have data? 'Yes'
... Number of sequences 1441629
... Min. sequence length 1
... Max. sequence length 1e+06
... Last_updated '2024-04-23 05:58:35'
INFO [2024-04-23 12:19:49] db_name: other_vertebrate
INFO [2024-04-23 12:19:49] db_ready: TRUE
Checking setup status at  ...
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Restez path ...
... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez'
... Does path exist? 'Yes'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Download ...
... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez/downloads'
... Does path exist? 'Yes'
... N. files 510
... Total size 62.1G
... GenBank division selections 'Other vertebrate'
... GenBank Release 259
... Last updated '2024-04-08 12:10:53'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Database ...
... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez/sql_db'
... Does path exist? 'Yes'
... Total size 50.7G
... Does the database have data? 'Yes'
... Number of sequences 832069
... Min. sequence length 1
... Max. sequence length 1e+06
... Last_updated '2024-04-12 00:00:57'
INFO [2024-04-23 12:19:50] db_name: plant_with_fungi_algae
INFO [2024-04-23 12:19:50] db_ready: TRUE
Checking setup status at  ...
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Restez path ...
... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez'
... Does path exist? 'Yes'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Download ...
... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez/downloads'
... Does path exist? 'Yes'
... N. files 1714
... Total size 337G
... GenBank division selections 'Plant (including fungi and algae)'
... GenBank Release 259
... Last updated '2024-04-08 11:57:06'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Database ...
... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez/sql_db'
... Does path exist? 'Yes'
... Total size 118G
... Does the database have data? 'Yes'
... Number of sequences 1345347
... Min. sequence length 1
... Max. sequence length 1e+06
... Last_updated '2024-04-15 02:54:05'
INFO [2024-04-23 12:19:50] done!
egreyavis commented 4 months ago

sweet!!

Erin K. Grey, PhD Phone: (773) 401-9849 Web: www.egreylab.com Email: @.***

On Tue, Apr 23, 2024 at 12:22 PM Ben Tupper @.***> wrote:

Good news! Three databases (databasii?) downloaded an operational...

INFO [2024-04-23 12:19:49] db_name: invertebrate INFO [2024-04-23 12:19:49] db_ready: TRUE Checking setup status at ... ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Restez path ... ... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez' ... Does path exist? 'Yes' ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Download ... ... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez/downloads' ... Does path exist? 'Yes' ... N. files 2100 ... Total size 261G ... GenBank division selections 'Invertebrate' ... GenBank Release 259 ... Last updated '2024-04-08 10:32:31' ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Database ... ... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez/sql_db' ... Does path exist? 'Yes' ... Total size 404G ... Does the database have data? 'Yes' ... Number of sequences 1441629 ... Min. sequence length 1 ... Max. sequence length 1e+06 ... Last_updated '2024-04-23 05:58:35' INFO [2024-04-23 12:19:49] db_name: other_vertebrate INFO [2024-04-23 12:19:49] db_ready: TRUE Checking setup status at ... ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Restez path ... ... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez' ... Does path exist? 'Yes' ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Download ... ... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez/downloads' ... Does path exist? 'Yes' ... N. files 510 ... Total size 62.1G ... GenBank division selections 'Other vertebrate' ... GenBank Release 259 ... Last updated '2024-04-08 12:10:53' ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Database ... ... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez/sql_db' ... Does path exist? 'Yes' ... Total size 50.7G ... Does the database have data? 'Yes' ... Number of sequences 832069 ... Min. sequence length 1 ... Max. sequence length 1e+06 ... Last_updated '2024-04-12 00:00:57' INFO [2024-04-23 12:19:50] db_name: plant_with_fungi_algae INFO [2024-04-23 12:19:50] db_ready: TRUE Checking setup status at ... ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Restez path ... ... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez' ... Does path exist? 'Yes' ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Download ... ... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez/downloads' ... Does path exist? 'Yes' ... N. files 1714 ... Total size 337G ... GenBank division selections 'Plant (including fungi and algae)' ... GenBank Release 259 ... Last updated '2024-04-08 11:57:06' ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Database ... ... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez/sql_db' ... Does path exist? 'Yes' ... Total size 118G ... Does the database have data? 'Yes' ... Number of sequences 1345347 ... Min. sequence length 1 ... Max. sequence length 1e+06 ... Last_updated '2024-04-15 02:54:05' INFO [2024-04-23 12:19:50] done!

— Reply to this email directly, view it on GitHub https://github.com/Maine-eDNA/mednaTaxaRef/issues/12#issuecomment-2072861154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3FEDPIMQ757GEALHGPFW3Y62DENAVCNFSM6AAAAABE73XBP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZSHA3DCMJVGQ . You are receiving this because you were mentioned.Message ID: @.***>