NHMDenmark / DanSpecify

Important files regarding the Danish instance of the Specify database system for collections digitisation and management, plus placeholder for issue tracking. Guidelines, manuals and other kinds of documentations will be gathered on the wiki.
3 stars 3 forks source link

Data dump #138

Open PipBrewer opened 2 years ago

PipBrewer commented 2 years ago

Would it be possible to get a data dump (as csv) of all records with all fields from Specify so I can use it to look at how fields are used?

FedorSteeman commented 2 years ago

So I decided it would still be easiest to write a relatively straightforward SQL query for this with the downside that I mentioned, that for each one-to-many relationship, row duplications will occur. The Specify-interface has a built-in way for aggregrating rows, but getting all the details out of the related objects will result in the same thing. Writing a recursive SQL query that flattens the information for each relationship would become too much of a time-intensive effort to qualify as "creating a simple dump".

I started out diligently selecting the fields that actually contain anything interesting at all for the specimen level data in any case, but eventually descended into just selecting everything for each of the related tables. To make it a bit clearer what the different columns stand for, I added separater columns in between each join. I used a Specify query "Full Query All Specimens" to get an idea of what tables to add to the query, which has been copied to both @PipBrewer and @AstridBVW 's accounts. More joins can be added if desired.

I'm not 100% I added everything of interest, but that is also because of the huge amount of data that can potentially be added. For instance, I skipped adding joins of determination agents and collector agents, so only the primary keys of those can be seen in the determination and collecting events respectively. The file itself is already huge (1,2 GB) and can perhaps not be opened in something like Excel. Getting dumps of each individual table may be more what you're after, but then you would need to look up the related info yourself.

In that respect, it is actually good to keep in mind that there is of course a reason that data is structured in a relational database system the way it is. Exercises like these underline that trying to squeeze everything into a flat file is not very useful in itself. What is useful is understanding how to retrieve specific subsets of data from the larger whole, or relying on someone else to do that for you.

In any case this is the SQL script that I used for this purpose: SpFullQuery.sql.txt

The dump itself I place on the N-drive here: N:\SCI-SNM-DigitalCollections\Specify\Specify Simple Dump 20220325.zip