ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
57 stars 13 forks source link

Export of Complete Collection Data for Migration, Backup #2051

Closed campmlc closed 1 year ago

campmlc commented 5 years ago

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Is your feature request related to a problem? Please describe. We have received repeated inquiries from potential new collections and existing collections as to whether collections data can be exported from Arctos can be exported for backup or migration to a different platform.

Describe the solution you'd like We currently allow export of flat file data as specimen search results through Arctos and DWC fields through external aggregators. Perhaps provide the option of a regular, automated export of these data, ftp'd to a particular server? Additionally, we could add in options for separate, linked downloads of transactions, projects, citations ? (by collection?), object tracking (show all objects in this container, flatten?)

Also explore the option of local Oracle backups by collection? or all Arctos?

Priority Please assign a priority-label.

campmlc commented 5 years ago

From @dustymc Sounds reasonable to me!

The "full" dump is computationally expensive and takes a lot of disk; I'm not sure if we could support running that for everyone at any reasonable interval with current resources.

loans + accessions + projects have all those problems too, but there aren't very many of them and they don't change very often - we could probably pull that off without much problem.

FLAT is cheap and easy to query (that's why it exists!) but is missing a lot of information - eg, it contains only one locality per specimen.

The DWC files contain full (except 'unaccepted') locality data, are in a Standard exchange format, and we could probably share more data than we do. I suspect that's our best bet for a "lightweight backup" but I'd need to know more about the purpose of the backup to make that call.

The Oracle backups contain everything, we're already paying the cost to make them, you can read them with free software, and I think TACC has essentially unlimited bandwidth. Scattering them across more disks in more locations under the control of more organizations would definitely make me sleep better. I think all I need is an address and write credentials to make that happen.

campmlc commented 5 years ago

The purpose of the backup would be to allow collections to maintain local flat file copies of their most critical data sufficient to recover the majority in case they decide to switch to a different platform or in case of catastrophic failure or downtime. Having this option would provide significant peace of mind to collections staff and admin, and would provide increase Arctos usability and marketability.

dustymc commented 5 years ago

I would need table/column detail to proceed; I can't know what anyone considers critical. (I know what I would consider critical: an Oracle backup file, which contains the rules and structure in addition to the data.)

I think a real-world use case would be very useful.

DWC data are here: http://ipt.vertnet.org:8080/ipt/resource?r=msb_mamm

1) What's missing? 2) Of that, what could we add?

diannakrejsa commented 5 years ago

Yes, I can speak to the desire for this as well as chief necessary fields. Are you looking for particular columns that are missing? Can you direct me to a single DWC extract sheet (where all the code table values are part of one spreadsheet)?

I've been creating a column matching sheet for migrating our data from mySQL extraction of the data from local Specify-derived server, to how the data is originally entered (so we know we're extracting all necessary fields whilst using a DWC extract schema in Specify), then mapping that to Arctos fields. It will serve as a guide to the IT expert assisting in the migration. I can share that.

This may require a phone call to be most effective if I'm missing some information or not addressing what you're asking. If there is a DWC schema (the column headings) that I can look at, I can tell you what key elements are missing for our data purposes anyway.

dustymc commented 5 years ago

Are you looking for particular columns that are missing?

Yes.

single DWC extract sheet (where all the code table values are part of one spreadsheet)

I don't think such a thing exists; no specimen will have eg, all Attributes, and many specimens are spread across multiple DWC:Occurrences.

DWC schema

https://www.tdwg.org/standards/dwc/, but DynamicProperties make it somewhat like Arctos in that it's not limited to a spreadsheet-like structure.

diannakrejsa commented 5 years ago

Is there a way to code as part of an 'export all' function "export all Attributes"?

Here is the Column Matching sheet I mentioned. It starts by matching all fields that we have for mammal records extracted from the server database to what we enter in "flat sheet" data entry spreadsheets. Then those fields are mapped to how they must be entered into Arctos.

Column Matching V2.xlsx

dustymc commented 5 years ago

I'm not sure I'm understanding, but....

The Arctos specimen bulkloader is a greatly simplified view of the most common things shared among incoming specimens.

You can see all current Attribute types at http://arctos.database.museum/info/ctDocumentation.cfm?table=CTATTRIBUTE_TYPE.

I can certainly export attributes, the question is how. As rows in an attachment, no problem. As structured data in a cell, MAYBE they'll fit now, but that won't necessarily last - attributes can hold about 8K, any specimen can have any number of them, and various tools have various length constraints.

There are many other such data - eg, any specimen can have any number of parts, any part can have any number of part attributes.

Lacking better ideas, those would likely include eg, "determiner: John Smith." Determining if that's John Smith the expert or John Smith the dyslexic prankster needs a link back to Agents. In Arctos, an agent's old phone number (publications, relations to other entities, etc.) is very much a real part of "the attribute record." (Or in general everything is a part of everything else.) In an export, if you want any of those "ancillary" data I'll need to explicitly know about it.

The simplest model we've found that's capable of carrying the complexity of the data is the one we use. The only backup I'm aware of from which that complexity could be recovered is the native Oracle backup.

campmlc commented 5 years ago

Dusty, I think in this case what we need is as follows, in the order top to bottom of what may be feasible: 1) A large flat file exactly like what we upload with the specimen bulkloader or get via the specimen results download, We can't download this ourselves because of browser timeout issues; otherwise we would. This would include all of the possible specimen bulkloader data fields = all fields added from add/remove data fields in specimen results. (Ideally, these fields would download only if there are data to populate them.) Attributes would have a Determined by Agent and Determined date field etc. Parts would include either JSON string or, even better, be parsed out into columns as they would go into the bulkloader (including barcode field). The event downloaded would be by default the most recent accepted event. The ID downloaded would be the most recent accepted ID. Agents would be preferred name. Obviously, they could not contain any other info from the agents table in this format. Accessions would be an included column. Citations would be an included column. Can we get a column for loans added as a general concatenated field and also embedded into the parts JSON script? Is this possible?

2) Download data on multiple specimen events, ID history- how do we do this? As concatenated fields like the OTHER IDs? JSON? Multiple columns?

3) Download accessions and loans as list.

4) Figure out a way to download a separate flatten part locations for all items in the collection - this could obviously be monstrous, but would be immensely helpful to have as a periodic backup / archive.

5) Figure out a way to download the full part location tree in print format

6) Download the agents table into something that can be archived on local servers?

Anything I'm missing?

On Wed, Apr 24, 2019 at 9:42 AM dustymc notifications@github.com wrote:

I'm not sure I'm understanding, but....

The Arctos specimen bulkloader is a greatly simplified view of the most common things shared among incoming specimens.

You can see all current Attribute types at http://arctos.database.museum/info/ctDocumentation.cfm?table=CTATTRIBUTE_TYPE .

I can certainly export attributes, the question is how. As rows in an attachment, no problem. As structured data in a cell, MAYBE they'll fit now, but that won't necessarily last - attributes can hold about 8K, any specimen can have any number of them, and various tools have various length constraints.

There are many other such data - eg, any specimen can have any number of parts, any part can have any number of part attributes.

Lacking better ideas, those would likely include eg, "determiner: John Smith." Determining if that's John Smith the expert or John Smith the dyslexic prankster needs a link back to Agents. In Arctos, an agent's old phone number (publications, relations to other entities, etc.) is very much a real part of "the attribute record." (Or in general everything is a part of everything else.) In an export, if you want any of those "ancillary" data I'll need to explicitly know about it.

The simplest model we've found that's capable of carrying the complexity of the data is the one we use. The only backup I'm aware of from which that complexity could be recovered is the native Oracle backup.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051#issuecomment-486297379, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBD5U5OBE5DEQKP4VQ3PSB5XPANCNFSM4HHTHLUA .

dustymc commented 5 years ago

exactly like what we upload with the specimen bulkloader or get via the specimen results download

Those are wildly different things. Eg, one will deal with 10 (or whatever the number is) parts with "core" part-components broken out, the other will deal with any number of parts (attributes, identifications, etc.) but with the complexity concatenated in various ways. I can certainly flatten various stuff into various formats (and much of that exists as FLAT), but I need specifics that address the reality of the data.

Attributes would have a Determined by Agent and Determined date field etc.

See above for the problems with merging them into structured data. I'm not sure how eg, 30 sex determinations might be munged into a spreadsheet - I suppose we could parse them out to sex_1 and sex_determiner_1 and such, but that would lead to a variable and indefinite number of columns.

Parts would include either JSON string

Not all fit within the current limitations of Oracle. That'll get better soon, but it's just a bump from 4KB to 32KB - some may still not fit.

even better, be parsed out into columns as they would go into the bulkloader

The specimen bulkloader can currently handle a ~dozen parts and no part attributes. The data can be many more parts, each with any number of attributes.

The ID downloaded would be the most recent accepted ID.

That one we can do! (As long as you don't care about taxon-stuff that won't fit into FLAT.)

column for loans added as a general concatenated field and also embedded into the parts JSON script? Is this possible?

That depends on what precisely you mean by "loans." If it's just a list of loan numbers or similar, probably. If you want more (loan data, results, involved parts, something for data loans, ....) then it likely won't easily fit.

Citations would be an included column.

That's available, but it links to Arctos so isn't very suitable for many of your reasons.

I can find a way around Oracle's datatype limitations (eg, write to files or CLOBs), but that would be computationally expensive (we can PROBABLY afford it), require a lot of disk, and I'm not sure what software would be capable of processing the results.

The purpose of the backup would be to allow collections to maintain local flat file copies of their most critical data sufficient to recover the majority in case they decide to switch to a different platform or in case of catastrophic failure or downtime. Having this option would provide significant peace of mind to collections staff and admin, and would provide increase Arctos usability and marketability.

This approach does not seem useful for that to me.

I don't think it's possible to flatten 'critical data' without significant loss (or perhaps significant liberties in defining "flat"!).

If I were going to migrate Arctos data to any other platform, I would want to start with an Oracle backup file. Absolute worst case, I could pay a consult for a few days to get what I want from it, whatever that might be.

In the case of catastrophic failure, recovering from a fresh copy of the backups stored somewhere that wasn't affected by the fire/meteor/aliens/Texan Revolution (and the stuff on GitHub) would be trivial. Recovering from anything else would be torturous.

In the case of significant downtime, pulling Arctos up (eg, on some cloud service or at another .edu) from backups (plus github) would be technically trivial, and mostly impossible from anything else that I can imagine.

campmlc commented 5 years ago

I guess the question would be, how did you extract the Cornell data to repatriate to them to go back into Specify? That would be the easiest scenario, because they didn't do anything with their data. Or maybe we didn't give them anything back after all that work?

This request is largely an assurance to potential users that if they hate us, they can go back to something non-Oracle, something like what they came in with, which is largely a collection of csv files. Think of it as part of marketing - may not be structural necessary in your view, but it is necessary psychologically and sociologically to get people to feel comfortable in our environment. Also allows users to maintain local backups just in case. I think we all want the latter, for the old "Dusty gets hit by a bus" catastrophe scenario. Don't do that, by the way, at least not until we get more funding :)

In my original request, this - bulkload file format with "core" part-components broken out - would be what I ideally would want, without the part and attribute limits. If this is computationally not possible, then concatenation in a format that would allow it to be parsed out later into a csv file with "core components broken out" would be acceptable. So, the specimen results download with various types of concatenation would be an OK replacement, although not ideal (I hate having to try to parse JSON into csv - but maybe I just don't know how.)

Loans - concatenated list of loan numbers OK in flat. Plus a separate download as loan list info from transactions menu. Again, it would be ideal to have the parts download show loan relationships as well in some way- back to JSON?

Citations - OK even without the external links.

We accept that there will be loss of data in this format. But recovering some data is better than loss of all, which is what happens when Arctos, or even our local internet, goes down.

I would be happy to help go through field by field to decide on data concatenation etc if that is what it takes. Google spreadsheet?

On Wed, Apr 24, 2019 at 1:11 PM dustymc notifications@github.com wrote:

exactly like what we upload with the specimen bulkloader or get via the specimen results download

Those are wildly different things. Eg, one will deal with 10 (or whatever the number is) parts with "core" part-components broken out, the other will deal with any number of parts (attributes, identifications, etc.) but with the complexity concatenated in various ways. I can certainly flatten various stuff into various formats (and much of that exists as FLAT), but I need specifics that address the reality of the data.

Attributes would have a Determined by Agent and Determined date field etc.

See above for the problems with merging them into structured data. I'm not sure how eg, 30 sex determinations might be munged into a spreadsheet - I suppose we could parse them out to sex_1 and sex_determiner_1 and such, but that would lead to a variable and indefinite number of columns.

Parts would include either JSON string

Not all fit within the current limitations of Oracle. That'll get better soon, but it's just a bump from 4KB to 32KB - some may still not fit.

even better, be parsed out into columns as they would go into the bulkloader

The specimen bulkloader can currently handle a ~dozen parts and no part attributes. The data can be many more parts, each with any number of attributes.

The ID downloaded would be the most recent accepted ID.

That one we can do! (As long as you don't care about taxon-stuff that won't fit into FLAT.)

column for loans added as a general concatenated field and also embedded into the parts JSON script? Is this possible?

That depends on what precisely you mean by "loans." If it's just a list of loan numbers or similar, probably. If you want more (loan data, results, involved parts, something for data loans, ....) then it likely won't easily fit.

Citations would be an included column.

That's available, but it links to Arctos so isn't very suitable for many of your reasons.

I can find a way around Oracle's datatype limitations (eg, write to files or CLOBs), but that would be computationally expensive (we can PROBABLY afford it), require a lot of disk, and I'm not sure what software would be capable of processing the results.

The purpose of the backup would be to allow collections to maintain local flat file copies of their most critical data sufficient to recover the majority in case they decide to switch to a different platform or in case of catastrophic failure or downtime. Having this option would provide significant peace of mind to collections staff and admin, and would provide increase Arctos usability and marketability.

This approach does not seem useful for that to me.

I don't think it's possible to flatten 'critical data' without significant loss (or perhaps significant liberties in defining "flat"!).

If I were going to migrate Arctos data to any other platform, I would want to start with an Oracle backup file. Absolute worst case, I could pay a consult for a few days to get what I want from it, whatever that might be.

In the case of catastrophic failure, recovering from a fresh copy of the backups stored somewhere that wasn't affected by the fire/meteor/aliens/Texan Revolution (and the stuff on GitHub) would be trivial. Recovering from anything else would be torturous.

In the case of significant downtime, pulling Arctos up (eg, on some cloud service or at another .edu) from backups (plus github) would be technically trivial, and mostly impossible from anything else that I can imagine.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051#issuecomment-486386915, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBDDXOOO7GZL4JYE4S3PSCWEFANCNFSM4HHTHLUA .

dustymc commented 5 years ago

Cornell

The "full" dump (=tables)

something non-Oracle,

From an Oracle dump:

1) install Oracle Express (or fire up an account with some Oracle host, or go talk nice to your local financial people who almost certainly use Oracle, or...) 1) impdb {dumpfile} 1) use one of the hundreds of available tools or scripts to pull out whatever you want (including DDL, rules, relationships, datatype, etc.) in whatever format you want it, ignore what you don't want.

From anything else:

1) Try to figure out what you have 2) Hope there's enough information to whatever you're trying to do

I don't think a flatfile "export" is a bad idea, but I do think it should come with some sort of explicit explanation of where it came from and what its limitations are.

more funding

Are you buying a bus?!

field by field

Yes, I think that's what it's going to take. Here's a sample of what's easiest to get to.

UAM@ARCTOS> create table temp_flatbits as select * from flat where guid like 'MSB:Mamm:%' and rownum<10000;

temp_flatbits.csv.zip

campmlc commented 5 years ago

Maybe an Oracle dump is good too, but how much expertise is required to clean up the output enough that a student could comprehend it or use it for subsequent data entry?

I like the idea of the flat file at least as a complementary approach. @dkrejsa@angelo.edu dkrejsa@angelo.edu let's look over this flatbits file as a start.

On Wed, Apr 24, 2019 at 3:13 PM dustymc notifications@github.com wrote:

Cornell

The "full" dump (=tables)

something non-Oracle,

From an Oracle dump:

  1. install Oracle Express (or fire up an account with some Oracle host, or go talk nice to your local financial people who almost certainly use Oracle, or...)
  2. impdb {dumpfile}
  3. use one of the hundreds of available tools or scripts to pull out whatever you want (including DDL, rules, relationships, datatype, etc.) in whatever format you want it, ignore what you don't want.

From anything else:

  1. Try to figure out what you have
  2. Hope there's enough information to whatever you're trying to do

I don't think a flatfile "export" is a bad idea, but I do think it should come with some sort of explicit explanation of where it came from and what its limitations are.

more funding

Are you buying a bus?!

field by field

Yes, I think that's what it's going to take. Here's a sample of what's easiest to get to.

UAM@ARCTOS> create table temp_flatbits as select * from flat where guid like 'MSB:Mamm:%' and rownum<10000;

temp_flatbits.csv.zip https://github.com/ArctosDB/arctos/files/3114179/temp_flatbits.csv.zip

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051#issuecomment-486428917, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBFDC5GXFQPRMNIHGVTPSDEQ5ANCNFSM4HHTHLUA .

dustymc commented 5 years ago

how much expertise is required to clean up the output enough that a student could comprehend it or use it for subsequent data entry

Nothing that hasn't been on stackoverflow a million times anyway, and the container describes the data.

The front-end is on github - it's not too hard to build a clone of Arctos from an Oracle dump and a git pull either.

I think it would just be a different type of expertise required to interpret a flatfile, assuming it contains what's needed to do whatever you'd be doing.

Here's another precompiled flat view of some data - this one will be much better at locality data, but doesn't contain any encumbered data. I'm not sure which one (if either) might be more useful.

UAM@ARCTOS> create table temp_dwc_bits as select * from digir_query.ipt_view where individualID like 'http://arctos.database.museum/guid/MSB:Mamm:%' and rownum < 10000;

temp_dwc_bits.csv.zip

diannakrejsa commented 5 years ago

temp_flatbits_missing values.xlsx

I've looked over both flatbits files. The second tab on the attached has the column headings transposed next to each other ("Comparison"). I looked at them with our data in mind and made additions in red at the bottom of the columns for what I think they're lacking or could benefit from. There are some fields that sound like they'd contain similar information -- INFORMATIONWITHHELD and ENCUMBRANCES for example. Is there a way to query all the available fields but just export ones with values? Or there would be too many redundancies and high processing time.

campmlc commented 5 years ago

So if I understand this correctly, all the attribute information would be concatentated into a single field, without dates or determiners, correct? So ATTRIBUTE could be: "ATTRIBUTE: sex = male, age class= adult, reproductive info = scrotal, t= 4 x 2," Is that correct? And would preparator number, collector number etc be in a similar concatenated field of OTHER IDs?

I agree with Dianna that it would be great if we could have a download of only those fields that are populated with data. I would also want all the fields that can be currently downloaded in the specimen results view using add/remove data fields to be options for download either as individual columns or as concatenated fields.

On Thu, Apr 25, 2019 at 8:46 AM diannakrejsa notifications@github.com wrote:

temp_flatbits_missing values.xlsx https://github.com/ArctosDB/arctos/files/3117463/temp_flatbits_missing.values.xlsx

I've looked over both flatbits files. The second tab on the attached has the column headings transposed next to each other ("Comparison"_. I looked at them with our data in mind and made additions in red at the bottom of the columns for what I think they're lacking or could benefit from. There are some fields that sound like they'd contain similar information -- INFORMATIONWITHHELD and ENCUMBRANCES for example. Is there a way to query all the available fields but just export ones with values? Or there would be too many redundancies and high processing time.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051#issuecomment-486703664, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBER4MWUV4L3RGJSUHTPSG72VANCNFSM4HHTHLUA .

dustymc commented 5 years ago

I'm sorry, I don't think I've sufficiently explained some things.

Most of Arctos is normalized; it's data objects, not columns and fields. It's complex data of no particular depth, and it is not trivial to squish it into a simple form, like a spreadsheet. http://arctos.database.museum/guid/UAM:Fish:15064 has 40 "total length" determinations, for example. There are currently 362 Attributes, and any specimen can in theory carry any number of determinations for any number of Attributes; there could be literally thousands of the things, and many other kinds of data work the same way.

Table FLAT - used to build temp_flatbits above - is an attempt to flatten the most common things people search on for performance reasons. Core data tables are monitored by triggers, so when someone adds a new collector a single row in FLAT will be updated, and when someone changes a preferred agent name thousands of rows in FLAT may be updated. We can make some adjustments to that table, but attempting to eg, add a column for all possible combinations of Attributes would just be a huge mess less accessible than the primary data.

Table digir_query.ipt_view is a similar solution for a different audience - it's Occurrence-centric rather than specimen-centric. We can add anything in https://dwc.tdwg.org/terms/ (or wherever the current Standard is) to it, but again there are limited options for complex data.

Some specific comments regarding the red text:

PREPARATOR

see also http://arctos.database.museum/info/ctDocumentation.cfm?table=CTCOLLECTOR_ROLE

PREP_NUMBER

http://arctos.database.museum/info/ctDocumentation.cfm?table=CTCOLL_OTHER_ID_TYPE&field=preparator%20number ??

AGE_CLASS (same as LIFESTAGE?)

Yes, for now - LIFESTAGE should probably include age and numeric age as well.

COLLECTOR (separate into individual agents if possible)

http://arctos.database.museum/guid/CHAS:Herp:1998.3.1 has 11 collectors; there is no real upper limit.

COLLECTORNUMBER

I need specifics to do this. There are currently 422 other ID types, and someone likely thinks each and every one of them is critical.

CREATED_BY_AGENT CREATED_DATE MODIFIED_BY_AGENT MODIFIED_DATE SECONDARY_IDENTIFIER ORIG_ELEV_UNITS ORIG_LAT_LONG_UNITS

Please note that FLAT contains only the "preferred" (https://github.com/ArctosDB/DDL/blob/master/functions/getPrioritySpecimenEvent.sql) locality. RELATEDCATALOGEDITEMS same as ASSOCIATEDOCCURRENCES?

Not exactly - a single cataloged item may be parsed into many occurrences.

REMARKS ACCESSION ENCUMBRANCES same as INFORMATIONWITHHELD?

yes

AUTHOR_TEXT

this is in FLAT - note it is a concatenated string for complex IDs, and only exists for one ID

ISPUBLISHED

This is "secondary" data - it's just type_status (TRUE/FALSE). It's no problem to include it.

GUID Associated MEDIA/SEQUENCES will link in media (IMAGEURL) and Genbank data?)

We have a DWC Media view as well - sample attached. It contains ONLY Media which is directly related to specimens.

temp_ipt_media.csv.zip

Reproductive Data Total Length Tail Length Hindfoot Length Ear Length Forearm Length Tragus Length Length Unit Weight Weight Unit

These are attributes; see above.

all the attribute information would be concatentated into a single field, without dates or determiners, correct?

UAM@ARCTOS> select attributes from flat where rownum<10;

ATTRIBUTES
------------------------------------------------------------------------------------------------------------------------
sex=male
sex=male
sex=female ; reproductive data=4plsc (3R,1L) ; total length=203 mm; tail length=102 mm; hind foot with claw=23 mm; ear f
rom notch=16 mm

sex=male ; reproductive data=T=3 ; total length=187 mm; tail length=63 mm; hind foot with claw=23 mm; ear from notch=15
mm; weight=42.9 g

weight=19.0 g; sex=female ; reproductive data=no plsc ; total length=189 mm; tail length=96 mm; hind foot with claw=23 m
m; ear from notch=16 mm

sex=female
sex=male ; reproductive data=T=3 ; total length=208 mm; tail length=105 mm; hind foot with claw=25 mm; ear from notch=17
 mm; weight=33.0 g

weight=33.7 g; sex=male ; reproductive data=T=6X4 ; ear from notch=19 mm; total length=205 mm; tail length=103 mm; hind
foot with claw=23 mm

sex=female ; weight=14.20 g; unformatted measurements=Verbatim weight=14.2g;ToL=114;TaL=28;HF=19;E=11; ; total length=11
4 mm; tail length=28 mm; hind foot with claw=19 mm; ear from notch=11 mm

9 rows selected.

preparator number, collector number etc

UAM@ARCTOS> select OTHERCATALOGNUMBERS from flat where OTHERCATALOGNUMBERS is not null and rownum<10;

OTHERCATALOGNUMBERS
------------------------------------------------------------------------------------------------------------------------
AF=29849
AF=29821
AF=34816; original identifier=RWF 586
AF=34766; original identifier=RWF 518
AF=34548; GenBank=AY529434
AF=15986; original identifier=RWF 406
AF=34295; original identifier=RWF 418
AF=34781; original identifier=RWF 496
AF=26596

9 rows selected.

download of only those fields that are populated with data

I could remove NULL "columns" after building a table or something, but I think most developers would agree that getting different headers for every iteration is not much fun. Definitely "minor details" in any case; we can do whatever without much problem.

all the fields that can be currently downloaded in the specimen results view using add/remove data fields to be options for download either as individual columns or as concatenated fields.

That is technically trivial, and might go a long way towards addressing much of the above. Unfortunately I think it comes at a cost we are unable to pay with our current resources. As a super quick-n-dirty point of reference, catnum12 (76 records) takes .04 seconds with the basic options and 6.8 seconds with everything turned on.

Can we find more resources to address this?

Here's what's included.

catnum12.csv.zip

campmlc commented 5 years ago

I think this is what we need:

all the fields that can be currently downloaded in the specimen results view using add/remove data fields to be options for download either as individual columns or as concatenated fields.

It sounds like we need more resources -what exactly does that mean? Processors? How do we go about this? Talk to TACC? Write a grant? Ask for member contributions?

On Thu, Apr 25, 2019 at 9:43 AM dustymc notifications@github.com wrote:

I'm sorry, I don't think I've sufficiently explained some things.

Most of Arctos is normalized; it's data objects, not columns and fields. It's complex data of no particular depth, and it is not trivial to squish it into a simple form, like a spreadsheet. http://arctos.database.museum/guid/UAM:Fish:15064 has 40 "total length" determinations, for example. There are currently 362 Attributes, and any specimen can in theory carry any number of determinations for any number of Attributes; there could be literally thousands of the things, and many other kinds of data work the same way.

Table FLAT - used to build temp_flatbits above - is an attempt to flatten the most common things people search on for performance reasons. Core data tables are monitored by triggers, so when someone adds a new collector a single row in FLAT will be updated, and when someone changes a preferred agent name thousands of rows in FLAT may be updated. We can make some adjustments to that table, but attempting to eg, add a column for all possible combinations of Attributes would just be a huge mess less accessible than the primary data.

Table digir_query.ipt_view is a similar solution for a different audience

  • it's Occurrence-centric rather than specimen-centric. We can add anything in https://dwc.tdwg.org/terms/ (or wherever the current Standard is) to it, but again there are limited options for complex data.

Some specific comments regarding the red text:

PREPARATOR see also http://arctos.database.museum/info/ctDocumentation.cfm?table=CTCOLLECTOR_ROLE

PREP_NUMBER

http://arctos.database.museum/info/ctDocumentation.cfm?table=CTCOLL_OTHER_ID_TYPE&field=preparator%20number ??

AGE_CLASS (same as LIFESTAGE?)

Yes, for now - LIFESTAGE should probably include age and numeric age as well.

COLLECTOR (separate into individual agents if possible)

http://arctos.database.museum/guid/CHAS:Herp:1998.3.1 has 11 collectors; there is no real upper limit.

COLLECTORNUMBER

I need specifics to do this. There are currently 422 other ID types, and someone likely thinks each and every one of them is critical.

CREATED_BY_AGENT CREATED_DATE MODIFIED_BY_AGENT MODIFIED_DATE SECONDARY_IDENTIFIER ORIG_ELEV_UNITS ORIG_LAT_LONG_UNITS

Please note that FLAT contains only the "preferred" ( https://github.com/ArctosDB/DDL/blob/master/functions/getPrioritySpecimenEvent.sql) locality. RELATEDCATALOGEDITEMS same as ASSOCIATEDOCCURRENCES?

Not exactly - a single cataloged item may be parsed into many occurrences.

REMARKS ACCESSION ENCUMBRANCES same as INFORMATIONWITHHELD?

yes

AUTHOR_TEXT

this is in FLAT - note it is a concatenated string for complex IDs, and only exists for one ID

ISPUBLISHED

This is "secondary" data - it's just type_status (TRUE/FALSE). It's no problem to include it.

GUID Associated MEDIA/SEQUENCES will link in media (IMAGEURL) and Genbank data?)

We have a DWC Media view as well - sample attached. It contains ONLY Media which is directly related to specimens.

temp_ipt_media.csv.zip https://github.com/ArctosDB/arctos/files/3117644/temp_ipt_media.csv.zip

Reproductive Data Total Length Tail Length Hindfoot Length Ear Length Forearm Length Tragus Length Length Unit Weight Weight Unit

These are attributes; see above.

all the attribute information would be concatentated into a single field, without dates or determiners, correct?

UAM@ARCTOS> select attributes from flat where rownum<10;

ATTRIBUTES

sex=male sex=male sex=female ; reproductive data=4plsc (3R,1L) ; total length=203 mm; tail length=102 mm; hind foot with claw=23 mm; ear f rom notch=16 mm

sex=male ; reproductive data=T=3 ; total length=187 mm; tail length=63 mm; hind foot with claw=23 mm; ear from notch=15 mm; weight=42.9 g

weight=19.0 g; sex=female ; reproductive data=no plsc ; total length=189 mm; tail length=96 mm; hind foot with claw=23 m m; ear from notch=16 mm

sex=female sex=male ; reproductive data=T=3 ; total length=208 mm; tail length=105 mm; hind foot with claw=25 mm; ear from notch=17 mm; weight=33.0 g

weight=33.7 g; sex=male ; reproductive data=T=6X4 ; ear from notch=19 mm; total length=205 mm; tail length=103 mm; hind foot with claw=23 mm

sex=female ; weight=14.20 g; unformatted measurements=Verbatim weight=14.2g;ToL=114;TaL=28;HF=19;E=11; ; total length=11 4 mm; tail length=28 mm; hind foot with claw=19 mm; ear from notch=11 mm

9 rows selected.

preparator number, collector number etc

UAM@ARCTOS> select OTHERCATALOGNUMBERS from flat where OTHERCATALOGNUMBERS is not null and rownum<10;

OTHERCATALOGNUMBERS

AF=29849 AF=29821 AF=34816; original identifier=RWF 586 AF=34766; original identifier=RWF 518 AF=34548; GenBank=AY529434 AF=15986; original identifier=RWF 406 AF=34295; original identifier=RWF 418 AF=34781; original identifier=RWF 496 AF=26596

9 rows selected.

download of only those fields that are populated with data

I could remove NULL "columns" after building a table or something, but I think most developers would agree that getting different headers for every iteration is not much fun. Definitely "minor details" in any case; we can do whatever without much problem.

all the fields that can be currently downloaded in the specimen results view using add/remove data fields to be options for download either as individual columns or as concatenated fields.

That is technically trivial, and might go a long way towards addressing much of the above. Unfortunately I think it comes at a cost we are unable to pay with our current resources. As a super quick-n-dirty point of reference, catnum12 (76 records) takes .04 seconds with the basic options and 6.8 seconds with everything turned on.

Can we find more resources to address this?

Here's what's included.

catnum12.csv.zip https://github.com/ArctosDB/arctos/files/3117714/catnum12.csv.zip

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051#issuecomment-486727778, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBDXPXM4SXKWHSPK5PTPSHGSPANCNFSM4HHTHLUA .

dustymc commented 5 years ago

I don't know exactly what I'm asking for. "Here's some money, make it go faster" worked exceptionally well once. I think starting with an Oracle performance consultant (after we're in 12, which has some native parallel capabilities that may do something significant for this) probably makes the most sense. There are lots of things we could do to greatly improve performance, but I suspect they all require many more resources than we have. Eg, if I had a blank check I'd either hire an engineer to build a RAC dedicated to query and a full-time DBA or two to take care of it, or move to some cloud service and let them worry about that stuff. I doubt if that's remotely feasible.

In general, TACC is really into processing massive amounts of data and this seems to be something they'd be interested in. I'd think they'd have someone there who could at least point us in the right direction, but I haven't been able to locate them.

Oracle 12 has bigger datatypes and more JSON capabilities; we might be able to do more of this as cached data in FLAT, but at the cost of more overhead/maintenance/disk. It's not a general-purpose solution, but it still might address some common problems.

Jegelewicz commented 5 years ago

What do we need to get Oracle 12? I have been feeling like we need to focus on this upgrade and getting a test resource back. Do we need to contact Oracle directly or is this a TACC discussion?

dustymc commented 5 years ago

TACC. There's progress - I logged in yesterday, but still no successful data transfer.

campmlc commented 5 years ago

Is it an administrative, financial, or technical issue?

On Fri, Apr 26, 2019, 7:18 AM dustymc notifications@github.com wrote:

TACC. There's progress - I logged in yesterday, but still no successful data transfer.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051#issuecomment-487053334, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBGHGI5D5UJAIFAJTEDPSL6IXANCNFSM4HHTHLUA .

campmlc commented 5 years ago

Here is the start of a list of data fields we would want in a basic export of collection data for local backup, based on MSB Mammal data fields. I've mapped the single record data entry headers and tables to the bulkload headers. There are some other items we will need: accession and loan info, part attributes - but these may need to be exported separately. Different collection types will have different data fields - e.g. there are no geology attributes here, or collecting event attrributes. https://docs.google.com/spreadsheets/d/1HwD8-UWOi1wG7x71NZbaVdQz9ULscSLBVh7_vVBsIE4/edit?usp=sharing

campmlc commented 5 years ago

Elevating to critical based on discussions at SPNHC 2019 meeeting

Jegelewicz commented 5 years ago

Also see https://github.com/ArctosDB/new-collections/issues/50#issuecomment-505631740

What Mariel heard at SPNHC 2019 was "The most important question to ask when considering which collection management software to use is: How easy is it to get my data out?"

dustymc commented 5 years ago

The most important question to ask when considering which collection management software to use is: How easy is it to get my data out?

Is that a Symbiota ad campaign?! "It's so easy to get your data out, it's like it was never really in an actual CMS! 😉... 😢"

Getting your data, and the data which describes it and are necessary to reassemble it, out of Arctos is easy - we do it every day.

Getting some deformed version of bits and pieces of your data - that is, not your data - is much less trivial, and from that it is, at best, difficult to avoid the Specify "wiggle something and we'll see what moves over here" drill.

I remain baffled.

diannakrejsa commented 5 years ago

What I can provide in this discussion is that I've been scarred from the difficulty of the Specify data extraction. I assume this is what everyone at SPNHC has gone through or heard about at some point as well.

I realize there is a lot of complexity that can be added in Arctos that will not be extractable, but for the basic input data I still want a recoverable flat file. That's why, before migration, I am trying to generate a flat file of all information about a record in a single row before I move data. And until, and probably after, Arctos has an export of the attributes and parts we developed and Mariel put in the Google Doc above (and that won't crash a browser window during export), I'll probably not let anyone use the Arctos browser form for ASNHC data entry and instead bulkload everything. I want those long-form flat file inputs as backup anyway.

dustymc commented 5 years ago

scarred from the difficulty of the Specify data extraction

We should form a support group!

There is an entire family of languages devoted to avoiding this situation. (Specify doesn't use it very well, which just makes things so much worse.) That's essentially the basis of my confusion - powerful tools exist, we fully use them, this looks like a request to avoid them and create the situation that lead to some very smart people creating powerful tools to avoid the situation we seem to be trying to create?!?!?!

Those languages, and the data formats capable of carrying them, do in fact make all of the complexity of Arctos accessible to anything that can understand the language. DDL documents data containers so that the situation you're facing does not come to exist.

not let anyone use the Arctos browser form for ASNHC data entry and instead bulkload everything

That is an approachable problem with an easy solution. However you get it there - and there's no limit on that, it's built to be flexible and accessible - specimen data ultimately goes through the same channel to be parsed out to "real Arctos." I just happen to have an archive package inserted in that channel - I have a copy of everything that's ever been deleted from the bulkloader, which generally happens as data successfully loads. I'd be happy to share that with you in some way. It's large-ish and cumbersome - I seldom need access and it's optimized for write - but it's just a table so it should be easy enough to pass around.

diannakrejsa commented 5 years ago

That seems like a good solution potentially! Making the tables from the bulkloader exportable/accessible for those with this export concern.

Jegelewicz commented 5 years ago
The most important question to ask when considering which collection management software to use is: How easy is it to get my data out?

Is that a Symbiota ad campaign?! "It's so easy to get your data out, it's like it was never really in an actual CMS! 😉... 😢"

Actually, in a way, yes - Deb Paul is convinced that Symbiota is the answer (although she would never admit that). I infer this because near the very end of our meeting she told the people from BRIT that "Symbiota is the most interoperable solution". Ugh.

scarred from the difficulty of the Specify data extraction

We should form a support group!

We already have! It meets here whenever anyone needs help coping.

I just happen to have an archive package inserted in that channel - I have a copy of everything that's ever been deleted from the bulkloader, which generally happens as data successfully loads.

WHAAAAAAA?! That seems nuts, but if we have space to store it, then it's a back-up of sorts (although I assume you also have deletes from things that were NOT successfully loaded? I do this a lot as I am cleaning data....)

I'll probably not let anyone use the Arctos browser form for ASNHC data entry and instead bulkload everything. I want those long-form flat file inputs as backup anyway.

Honestly, I did this at UTEP just because it was easier for the students to use Excel than the single record data entry form and I could review and make corrections before they made it in at all. It's a personal preference and I think it's great that we have the choice!

dustymc commented 5 years ago

space

It's a big table, but barely noticeable in the grand scheme of things - about 3.6GB with essentially no overhead. It solves a huge number of "Arctos is haunted!!!" issues (turns out Arctos probably isn't really haunted....) so it's very worth the costs to me. We should probably purge older data, but it's not become a concern so I haven't. We could also do something fancier with that sort of data, but again it's not causing problems and I've got bigger fish to fry so here we are.

Yes it's just a trigger so it contains everything. It's not perfect, but most everything (I think everything that someone with superpowers didn't have a hand in) will have NULL loaded if it made it through and "DELETED" loaded if it didn't - shouldn't be too hard to separate out the stuff that got removed along the way.

review and make corrections

You can also download things entered through the data entry form.

choice

Yes, there was never really intended to a "THE" data entry form - we imagined people would write their own specialized apps and such, and the system has always been built to accommodate that. Maybe someday....

Screen Shot 2019-06-26 at 9 14 21 AM

is QnD access. It won't currently pull anything older than ~5 years or so - whenever we added the VPD layer to the bulkloader. It's easy enough to get around that, but it should probably be it's own issue as it will likely inadvertently involve unrelated collections.

campmlc commented 5 years ago

Will this include edits made through the specimen detail page? I make edits this way all the time.

I had hoped we could just select fields from the specimen results page (add/remove data fields) and have that download (in coherent order, as described) sent to us on a regular interval via ftp or some other means to bypass the browser limitations. As Dusty says, we download Arctos specimen data daily, but we are limited by how many records and fields we can download because of the browser. It would be very easy and intuitive to be able to download all records as displayed in specimen results as a flat file with the fields of our choosing.

On Wed, Jun 26, 2019 at 10:22 AM dustymc notifications@github.com wrote:

space

It's a big table, but barely noticeable in the grand scheme of things - about 3.6GB with essentially no overhead. It solves a huge number of "Arctos is haunted!!!" issues (turns out Arctos probably isn't really haunted....) so it's very worth the costs to me. We should probably purge older data, but it's not become a concern so I haven't. We could also do something fancier with that sort of data, but again it's not causing problems and I've got bigger fish to fry so here we are.

Yes it's just a trigger so it contains everything. It's not perfect, but most everything (I think everything that someone with superpowers didn't have a hand in) will have NULL loaded if it made it through and "DELETED" loaded if it didn't - shouldn't be too hard to separate out the stuff that got removed along the way.

review and make corrections

You can also download things entered through the data entry form.

choice

Yes, there was never really intended to a "THE" data entry form - we imagined people would write their own specialized apps and such, and the system has always been built to accommodate that. Maybe someday....

[image: Screen Shot 2019-06-26 at 9 14 21 AM] https://user-images.githubusercontent.com/5720791/60196823-e1578580-97f2-11e9-9851-79c3824bb8ef.png

is QnD access. It won't currently pull anything older than ~5 years or so

  • whenever we added the VPD layer to the bulkloader. It's easy enough to get around that, but it should probably be it's own issue as it will likely inadvertently involve unrelated collections.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051?email_source=notifications&email_token=ADQ7JBDN7ED74A6MENAZBCTP4OJTXA5CNFSM4HHTHLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUCGEY#issuecomment-505946899, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBCLZSQASOTW6L3ECGDP4OJTXANCNFSM4HHTHLUA .

dustymc commented 5 years ago

edits made through the specimen detail page?

No, it's a bulkloader archive.

browser

That isn't the limitation, or at least not one that bothers me.

  1. Making the tables is computationally expensive - I don't think we currently have the CPU to support something like all specimen results options for a collection. Easy enough from my perspective, I just need more resources.
  2. The result is not something from which it's possible to re-create the data from which it was created. That's fine, as long as we don't pretend it's anything else. (The title of this Issue makes me think we do in fact want to pretend this is something it can't be. Maybe I'm reading too much into that?)
campmlc commented 5 years ago

I am happy with a flat file that could be used to re-enter data into any platform. I believe Dianna is requesting the same. While I also think it is a good idea to have an Arctos backup at an institution with an Oracle license, I'm not talking about re-creating Arctos from the backup. This is just one (or more) flat files that can be used as a digital or hard copy backup of the specimen data, period. How many specimen results can we extract at a time? Maybe we do this in multiple files?

On Wed, Jun 26, 2019 at 10:49 AM dustymc notifications@github.com wrote:

edits made through the specimen detail page?

No, it's a bulkloader archive.

browser

That isn't the limitation, or at least not one that bothers me.

  1. Making the tables is computationally expensive - I don't think we currently have the CPU to support something like all specimen results options for a collection. Easy enough from my perspective, I just need more resources.
  2. The result is not something from which it's possible to re-create the data from which it was created. That's fine, as long as we don't pretend it's anything else. (The title of this Issue makes me think we do in fact want to pretend this is something it can't be. Maybe I'm reading too much into that?)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051?email_source=notifications&email_token=ADQ7JBCPF547FBGL3YFDBILP4OM23A5CNFSM4HHTHLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUEUAA#issuecomment-505956864, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQ7JBBMQNVOWZGP3CTRGCLP4OM23ANCNFSM4HHTHLUA .

dustymc commented 5 years ago

Symbiota is the most interoperable solution"

Well it is an exchange standard wrapper and interoperable is why exchange standards exist; they're not wrong. That's also precisely the reason exchange standards cannot be good management platforms.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3253084/

AFAIK everything that can be in Symbiota can be exported from Arctos via DWC. If "what you can do in Symbiota" is close enough to "your data" then this isn't an issue.

flat file that could be used to re-enter data

There can be no such thing. If we could reasonably represent Arctos data in a simpler structure, we'd have done so.

good idea to have an Arctos backup at...

Give me some space and the credentials to access it and I'll make that happen.

... an institution with an Oracle license

I'm not sure why this is relevant.

(or more)

Tables as CSV is possible (probably also with more resources). That cannot include DDL; it's purposefully placing yourself where Dianna is now.

This should probably be reprioritized and added to the AWG agenda; I think we're somehow typing around each other.

dustymc commented 4 years ago

Dropping critical tag; this is not breaking existing functionality.

dustymc commented 4 years ago

From https://github.com/ArctosDB/internal/issues/50#issuecomment-571765625

automated local backups of flat files equivalent to what goes into the specimen bulkloader.

Possible with some work, but it won't be complete except by accident - eg, I load "bare" records then immediately bulkload in the 50 parts (or parts with attributes, or ....) that won't fit in the specimen bulkloader more often than not.

cleaner, more locally useful and accessible copy.

"Incomplete" seems the appropriate word. If we could get at the information contained in Arctos in a simpler format, we would.

see edit history

There's an Issue somewhere, the big question is "what's an edit?"

How possible is this with PG?

PG seems to be very good at making CSV. Pushing the complexity of Arctos to a single flat file is anything but trivial, but if you can define something that works for you then I can probably build it.

Finding the resources to support some complex CSV-ification should be simpler (at least more possible) under PG, but that's something ya'll will need to work out with TACC.

Yes, I believe we all realize that a csv backup would be incomplete, but we would like to have some sort of local backup regardless. Ideally, we could have a standard template with most of the major fields for who what when where how, but the type and extent data that are downloaded could be customized by the collections. We could have separate archive templates for specimen data, accessions, and loans. We could decide on the core data that we feel would be necessary if we lost everything and had to repopulate a database from the csv files.

dustymc commented 4 years ago

separate archive templates for specimen data, accessions, and loans.

I'd just need structure, contents, and priority to get started.

Priority is the most critical part, I think - who would find value in such a thing?

Next, to me, is WHY do you find value in such a thing? Can we do something about our actual infrastructure, or your perceptions of our infrastructure, or somehow focus our energies on something that could ACTUALLY be used in a disaster recovery situation?

Jegelewicz commented 4 years ago

focus our energies on something that could ACTUALLY be used in a disaster recovery situation?

I think this is important. If we lose connectivity, either wholesale or a single institution, we need a way to continue daily operations (search, loan, documenting stuff) that can then be fed to Arctos when connectivity is restored. Definitely not a quick fix, but something we should be figuring out.

dustymc commented 4 years ago

that can then be fed to Arctos when connectivity is restored.

Depending on what exactly you mean, that's somewhere between magic and keeping a copy of the various bulkloader templates - I think everything can be entered into Arctos via CSV, and if not that's pretty simple to address.

A local DB backup would be useful in disaster recovery, and could be (in conjunction with stuff from github, which could also be backed up locallty) used to put up a clone. Definitely not an afternoon project (although perhaps not much more if PG works as intended, and minus reports and background services in the short term). Repatriating things that happen there to a different instance is likely plausible (depending on how many instances are being combined and what's happened in them while apart), but far from trivial.

Cloud services exist in no small part to do what you describe; Google probably kills a few hundred servers every day without notice. Moving Arctos to a cloud service is very much a possibility, and would just require funding. (It was something like $12K/month on Amazon when we looked at this a few years back.)

I believe TACC has big pipelines to various other .edus, and given appropriate resources might be able to feed hot backup(s) through them.

Maintaining single-instance connectivity is about the same wherever Arctos lives - it's almost certainly vastly cheaper to grab a few cell hotspots (or maybe to lay fiber!) than to replicate infrastructure locally.

It's still not clear to me what the objectives of this are.

Jegelewicz commented 4 years ago

It's still not clear to me what the objectives of this are.

For me it is just being able to work when "Arctos is down", first of all just finding stuff that someone visiting the collection has asked for. If everything is barcoded, then nothing can be found without the data....

dustymc commented 4 years ago

finding...barcoded

A specimenresults download including partdetail contains that, and I think it'll download for a collection without much problem in PG. (I've got everything ripped apart and can't confirm right now.)

campmlc commented 4 years ago

Problem with partdetail and barcodes is the difficulty searching/extracting to Excel from JSON. I think I asked about this before - is there some sort of extraction tool? For me, being able to work locally or switch seamlessly to an alternate server or mirror when Arctos goes down is high priority. I imagine it would be very eye-opening to see how many person-hours are idled when we have an outage. We should definitely investigate funding. Another option might be to re-vamp Arctos Air for this purpose.

Also, the ability to extract one's data from a database in csv format, for backup or to move to a new platform, is a constant, repeated question that comes up at meetings and in interactions with folks interested in Arctos. I believe this will end up being best practices when such a thing is developed; already it is on iDigBio's list. If we want to be competitive with other systems, we need this feature. It is not just a question of psychological comfort or of learning to trust the machine.

On Wed, Jan 8, 2020 at 12:44 PM dustymc notifications@github.com wrote:

  • UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

finding...barcoded

A specimenresults download including partdetail contains that, and I think it'll download for a collection without much problem in PG. (I've got everything ripped apart and can't confirm right now.)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051?email_source=notifications&email_token=ADQ7JBHG2NGYF2YTIFCGPM3Q4YUJRA5CNFSM4HHTHLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINXWEI#issuecomment-572226321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBDR2AHN7UOR6HEJ43DQ4YUJRANCNFSM4HHTHLUA .

dustymc commented 4 years ago

extraction tool

All kinds of things will talk to JSON - the tool isn't much the problem, it's the structure; what do you DO with, say, 10 parts each having a half-dozen associated containers and perhaps another half-dozen associated attributes, each with their own structure? That sort of thing (particularly when the structure is potentially different for every record) isn't trivial to boil down to cells in a spreadsheet.

csv format

CSV is trivial (particularly under postgres), but not as any sort of comprehensive view of a bunch of tables smooshed into a 'document.' You will need to define the target before I (or anyone else) can write code to generate such a thing.

iDigBio

They deal with something that is designed to be spreadsheet-ish. If the data iDigBio can deal with serves this purpose, they are and always have been available at vertnet.

learning to trust the machine.

The paranoia is appreciated - I'm fairly sure I trust machines less than anyone here - it's what we do about it that's in question. Big-picture, I think there are two possibilities and they don't have to be mutually-exclusive, although depending on the details might need more resources of some sort.

1) Something that can be used to recover the database. That's trivial, maybe more trivial (even as text) under PG. You could potentially open it in Excel, but it would be as a hundred-ish pages (and a couple of them would have a hundred million or so rows, which might be problematic).

2) Something that's more Excel-oriented. As above the big problem from my perspective is defining the structure; I'd need to know what precisely you want to do with eg n parts, where n can be anywhere from 0 to "lots" (currently 125) and each can have lots of 'metadata.' This would almost certainly need to come with the understanding that creation is a lossy process - it would likely not be useful for anything that might be termed 'recovery.'

Jegelewicz commented 4 years ago

Also, the ability to extract one's data from a database in csv format, for backup or to move to a new platform, is a constant, repeated question that comes up at meetings and in interactions with folks interested in Arctos.

The iDigBio people tell EVERYONE that the most important questions to ask of any DB provider is "how difficult is it to get my data out?".

dustymc commented 4 years ago

I can't completely disagree with that, but there's some complexity. These all seem somewhat realistic situations to me:

  1. Use an exchange standard as a database - your data are never really "in" so it's easy enough to get them out! Not something I'd ever consider, but it's incredibly common.
  2. Consider whatever's in an exchange standard "your data" and all the rest of the junk you might find in a system like Arctos "metadata" - in that case you can get "your data" from VertNet anytime you want them. This might be a realistic approach to more unrealistic situations - if somehow TACC loses all backups, you'll still have "core" data (which you can type in after rebuilding civilization!)
  3. Consider everything in Arctos "your data," in which case you need an equally-sophisticated container to hold them. I would LOVE to make that easy for someone by pushing to a mirror, hot backup, off-site backup, or whatever they're willing to build. This is absolutely what I'd do as a CM, if I could find the resources to pull it off.
  4. Do something else, like consider "random stuff that'll fit in Excel" as "your data" - totally possible, I just need a well-defined container (to start). It's not really clear to me how this might be better than (2), perhaps after we spend some time thinking about what we're sharing via DWC (which should be beneficial to everyone - https://github.com/ArctosDB/arctos/issues/2141).

And FWIW iDigBio was also apparently telling people that SQL won't scale past two million rows, which is something like arguing that French won't scale past 500 words....

campmlc commented 4 years ago

With the move to PG, eliminating the requirement for a site license, how hard would it be and how much would it cost to have and maintain amirror that could kick in if Arctos goes down? I still advocate for the customized csv backup with richer data than Darwin core as backup strategy, in addition to above. We just need to hash out what fields and how to implement the local data transfer, ideally in automated schedule.

On Wed, Jan 8, 2020, 5:54 PM dustymc notifications@github.com wrote:

  • UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

I can't completely disagree with that, but there's some complexity. These all seem somewhat realistic situations to me:

  1. Use an exchange standard as a database - your data are never really "in" so it's easy enough to get them out! Not something I'd ever consider, but it's incredibly common.
  2. Consider whatever's in an exchange standard "your data" and all the rest of the junk you might find in a system like Arctos "metadata" - in that case you can get "your data" from VertNet anytime you want them. This might be a realistic approach to more unrealistic situations - if somehow TACC loses all backups, you'll still have "core" data (which you can type in after rebuilding civilization!)
  3. Consider everything in Arctos "your data," in which case you need an equally-sophisticated container to hold them. I would LOVE to make that easy for someone by pushing to a mirror, hot backup, off-site backup, or whatever they're willing to build. This is absolutely what I'd do as a CM, if I could find the resources to pull it off.
  4. Do something else, like consider "random stuff that'll fit in Excel" as "your data" - totally possible, I just need a well-defined container (to start). It's not really clear to me how this might be better than (2), perhaps after we spend some time thinking about what we're sharing via DWC (which should be beneficial to everyone - #2141 https://github.com/ArctosDB/arctos/issues/2141).

And FWIW iDigBio was also apparently telling people that SQL won't scale past two million rows, which is something like arguing that French won't scale past 500 words....

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051?email_source=notifications&email_token=ADQ7JBCIXRHNYDB3FF3GIGLQ4ZYULA5CNFSM4HHTHLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOQ6KQ#issuecomment-572329770, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBCVLCSV7IAWY6B6J2LQ4ZYULANCNFSM4HHTHLUA .

campmlc commented 4 years ago

Also, can we maintain Arctos Air as our local copies, to use when internet is out?

On Wed, Jan 8, 2020, 6:30 PM Mariel Campbell campbell@carachupa.org wrote:

With the move to PG, eliminating the requirement for a site license, how hard would it be and how much would it cost to have and maintain amirror that could kick in if Arctos goes down? I still advocate for the customized csv backup with richer data than Darwin core as backup strategy, in addition to above. We just need to hash out what fields and how to implement the local data transfer, ideally in automated schedule.

On Wed, Jan 8, 2020, 5:54 PM dustymc notifications@github.com wrote:

  • UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

I can't completely disagree with that, but there's some complexity. These all seem somewhat realistic situations to me:

  1. Use an exchange standard as a database - your data are never really "in" so it's easy enough to get them out! Not something I'd ever consider, but it's incredibly common.
  2. Consider whatever's in an exchange standard "your data" and all the rest of the junk you might find in a system like Arctos "metadata" - in that case you can get "your data" from VertNet anytime you want them. This might be a realistic approach to more unrealistic situations - if somehow TACC loses all backups, you'll still have "core" data (which you can type in after rebuilding civilization!)
  3. Consider everything in Arctos "your data," in which case you need an equally-sophisticated container to hold them. I would LOVE to make that easy for someone by pushing to a mirror, hot backup, off-site backup, or whatever they're willing to build. This is absolutely what I'd do as a CM, if I could find the resources to pull it off.
  4. Do something else, like consider "random stuff that'll fit in Excel" as "your data" - totally possible, I just need a well-defined container (to start). It's not really clear to me how this might be better than (2), perhaps after we spend some time thinking about what we're sharing via DWC (which should be beneficial to everyone - #2141 https://github.com/ArctosDB/arctos/issues/2141).

And FWIW iDigBio was also apparently telling people that SQL won't scale past two million rows, which is something like arguing that French won't scale past 500 words....

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2051?email_source=notifications&email_token=ADQ7JBCIXRHNYDB3FF3GIGLQ4ZYULA5CNFSM4HHTHLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOQ6KQ#issuecomment-572329770, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBCVLCSV7IAWY6B6J2LQ4ZYULANCNFSM4HHTHLUA .

dustymc commented 4 years ago

mirror that could kick in if Arctos goes down?

https://www.postgresql.org/docs/12/supported-platforms.html - PG doesn't need much, making it perform takes more. You would need a CF license to handle the background stuff and reports, but hopefully that won't survive SABI (and if we're bringing production up off-site we could transfer the license).

We can talk about details after we have structure, but a CSV push is unlikely to be a problem.

DWC could be a lot richer than what we currently share.

AIR is a data entry wrapper.

campmlc commented 4 years ago

If I had specific details, we could look into a mirror at UNM. Would a mirror alleviate periodic "Arctos is down" crashes? Or just get sucked in to the same problem?

dustymc commented 4 years ago

specific details

Chris can give you that - it'll probably evolve as we move towards production. I think 99% of the cost would be in running it, and would depend on things like how "hot" it is.

Maybe 99% of outages are internal network stuff, which I'd think could be fixed at TACC (but perhaps not for what they charge us).

Once a year or so there's some external network issue, which could only be addressed by a remote install.

CF dies very occasionally, in no small part because we're running an antique version on an antique VM. Clustering should fix that, and we won't need licenses under Lucee.

Oracle basically never goes down - who knows what PG will be like under load but I'd guess even more stable.

I would LOVE to have some sort of off-site component, but we should also carefully consider how much of our limited resources to invest in something like that. Some disk with SSH access to stash backups would be a huge security improvement, and would be the critical component of any more-sophisticated strategy anyway. After that, my preference would be to harden the infrastructure at TACC.