Open jhpoelen opened 6 years ago
The underlying reason is that we don't generate archives for data resources that have been registered but do not yet have any records loaded.
It would be ideal if the collections.ala.org.au service didn't generate the URLs in this case to avoid confusion. I have a feeling that the archive URL is generated by code using a pattern without checking if the archive exists or there are records in biocache.ala.org.au for the data resource.
@ansell thanks for the clarification. I was wondering whether one could use the "status" fields (e.g., "status" : "identified"
, see below) as a way to filter out the resources that have been registered, but not yet loaded. If so, which status would indicate a loaded resource?
From https://collections.ala.org.au/ws/dataResource/dr3561 -
{
"name": "Aboriginal Cultural Heritage (NLP)",
"acronym": null,
"uid": "dr3561",
"guid": null,
"address": null,
"phone": null,
"email": null,
"pubShortDescription": null,
"pubDescription": "The Aboriginal Cultural Heritage project increases Aboriginal engagement and participation in sustainable NRM as a part of the Sustainable Communities program.",
"techDescription": null,
"focus": null,
"state": null,
"websiteUrl": null,
"alaPublicUrl": "https://collections.ala.org.au/public/show/dr3561",
"networkMembership": null,
"hubMembership": [],
"taxonomyCoverageHints": [],
"attributions": [],
"dateCreated": "2016-02-14T11:35:57Z",
"lastUpdated": "2016-02-14T11:35:57Z",
"userLastModified": "Data services",
"provider": {
"name": "Aboriginal Cultural Heritage (NLP)",
"uri": "https://collections.ala.org.au/ws/dataProvider/dp2243",
"uid": "dp2243"
},
"rights": null,
"licenseType": "other",
"licenseVersion": null,
"citation": null,
"resourceType": "records",
"dataGeneralizations": null,
"informationWithheld": null,
"permissionsDocument": null,
"permissionsDocumentType": "Other",
"contentTypes": [],
"hasMappedCollections": false,
"status": "identified",
"provenance": null,
"harvestFrequency": 0,
"lastChecked": null,
"dataCurrency": null,
"harvestingNotes": null,
"publicArchiveAvailable": false,
"publicArchiveUrl": "http://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip",
"gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip",
"downloadLimit": 0,
"gbifDataset": false,
"isShareableWithGBIF": true,
"verified": false,
"gbifRegistryKey": null,
"doi": null
}
Or perhaps "publicArchiveAvailable": false
?
The best I can recommend at this point is checking the value of the status
field, to check if it is set to dataAvailable
, rather than identified
or something else. I am not very familiar with the collections.ala.org.au API, but looking at some datasets that do have archives, it appears that the status
field is the best avenue.
publicArchiveAvailable
sounds like it should be the way to go, but it appears to be inconsistent, with it being false
on some datasets I have looked at so far which have archives available.
Thanks for sharing your insights.
I poked around and found an example (dr6504) with "status" : "dataAvailable"
, but seems to have a url that is a 404 (e.g., http://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip ) .
In this particular example, the "publicArchiveAvailable" : false
.
One idea would be to run Preston on all data resource and measure which archive urls are active and which are not. Let me know if you'd find this useful.
{
"name": "Advisory List of Threatened Invertebrate Fauna in Victoria 2009",
"acronym": null,
"uid": "dr6504",
"guid": null,
"address": null,
"phone": null,
"email": null,
"pubShortDescription": null,
"pubDescription": "Victorian advisory list for Invertebrate Fauna",
"techDescription": "This list was first uploaded by Paul Skeen on the Wed Oct 12 01:06:06 UTC 2016.It contains [totalRecords:636, successfulItems:0] taxa.",
"focus": null,
"state": null,
"websiteUrl": "http://lists.ala.org.au/speciesListItem/list/dr6504?max=10",
"alaPublicUrl": "https://collections.ala.org.au/public/show/dr6504",
"networkMembership": null,
"hubMembership": [],
"taxonomyCoverageHints": [],
"attributions": [],
"dateCreated": "2016-10-12T01:06:27Z",
"lastUpdated": "2016-10-12T01:06:28Z",
"userLastModified": "Species list upload",
"rights": null,
"licenseType": "other",
"licenseVersion": null,
"citation": null,
"resourceType": "species-list",
"dataGeneralizations": null,
"informationWithheld": null,
"permissionsDocument": null,
"permissionsDocumentType": "Other",
"contentTypes": [
"species list"
],
"hasMappedCollections": false,
"status": "dataAvailable",
"provenance": null,
"harvestFrequency": 0,
"lastChecked": null,
"dataCurrency": null,
"harvestingNotes": null,
"publicArchiveAvailable": false,
"publicArchiveUrl": "http://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip",
"gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip",
"downloadLimit": 0,
"gbifDataset": false,
"isShareableWithGBIF": true,
"verified": false,
"gbifRegistryKey": null,
"doi": null
}
Also filtering for only "resourceType": "records"
could be useful. In this case dr6504
represents a species-list
, which are located in lists.ala.org.au and are not occurrence records so we don't currently dump them to Darwin Core Archives. Species lists should be representable using Darwin Core Archives, with a different rowType
to occurrence records, so they could be exported in future.
Using your suggestion, I found a dataResource with "status": "dataAvailable"
and "resourceType": "records"
with a 404 at https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip. Can you reproduce? If so, please let me know if there's additional criteria that can be used.
Thanks for your patience.
{
"name": "AIMS - LTM Nearshore Corals (OBIS Australia)",
"acronym": null,
"uid": "dr122",
"guid": null,
"address": null,
"phone": null,
"email": null,
"pubShortDescription": null,
"pubDescription": "Surveys of coral species richness were carried out at nearshore reefs of the Great Barrier Reef, Australia in conjunction with surveys of size structure and percentage cover of hard and soft coral communities. Species lists (Presence / Absence) were compiled at 2m and 5m below datum at two sites on 33 reefs between Mackay and Cooktown (latitude 16-23 degrees South) in 2004. The aim of the study was to document the status of nearshore coral communities in this region to serve both as a baseline against which future change could be compared and also identify communities potentially at risk from anthropogenic activities. Hard corals were identified to species level (although on occasion identification was limited to genus) and soft corals were identified to genus.",
"techDescription": null,
"focus": null,
"state": null,
"websiteUrl": null,
"alaPublicUrl": "https://collections.ala.org.au/public/show/dr122",
"networkMembership": null,
"hubMembership": [
{
"uid": "dh3",
"name": "Ocean Biogeographic Information System",
"uri": "https://collections.ala.org.au/ws/dataHub/dh3"
}
],
"taxonomyCoverageHints": [],
"attributions": [],
"dateCreated": "2010-09-14T00:05:26Z",
"lastUpdated": "2011-07-05T04:21:50Z",
"userLastModified": "David.Martin@csiro.au",
"provider": {
"name": "Institute of Marine and Coastal Sciences, Rutgers University",
"uri": "https://collections.ala.org.au/ws/dataProvider/dp18",
"uid": "dp18"
},
"rights": "Acknowledge the use of records from this dataset in the form appearing in the 'Citation' field and acknowledge the use of the OBIS facility. Recognise the limitatons of data in OBIS.",
"licenseType": null,
"licenseVersion": null,
"citation": "AIMS - Status of Nearshore Reefs of the GBR: H Sweatman, A Thompson, S Delean, J Davidson, S Neale",
"resourceType": "records",
"dataGeneralizations": null,
"informationWithheld": null,
"permissionsDocument": null,
"permissionsDocumentType": null,
"contentTypes": [],
"connectionParameters": {
"protocol": "DIGIR",
"resource": "aims_ltm_ns",
"url": "http://iobis.marine.rutgers.edu/digir2/DiGIR.php",
"termsForUniqueKey": [
"institutionCode",
"collectionCode",
"catalogNumber"
]
},
"hasMappedCollections": false,
"status": "dataAvailable",
"provenance": "Published dataset",
"harvestFrequency": 0,
"lastChecked": null,
"dataCurrency": null,
"harvestingNotes": null,
"publicArchiveAvailable": false,
"publicArchiveUrl": "http://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip",
"gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip",
"downloadLimit": 0,
"gbifDataset": false,
"isShareableWithGBIF": true,
"verified": false,
"gbifRegistryKey": null,
"doi": null
}
That data resource had records in the past (the collections.ala.org.au page shows downloads of records in the past, but none recently https://collections.ala.org.au/public/show/dr122 ), but has had all of its records deleted since then. Because it had records in the past, it likely received the dataAvailable
flag at some point and has not had it revoked when the records were deleted.
I have switched its dataAvailable
flag back to identified
.
There are possibly others in the same category that you can identify using the "Usage statistics" on the public collections HTML page. Otherwise, once those issues are cleared, dataAvailable
should be a fairly reliable flag.
More generally, we have not exported data for a few months while we have transitioned to a new version of biocache-store/biocache-service/cassandra/solr, but it is on our todo list to refresh the archive dumps and get them back to being automatically refreshed monthly. You may also find some new datasets that have been created and loaded into the new system for the first time which will not have exported archives yet, but will after we restart the archive creation process.
Just to clarify the underlying cause for dr122
a little further. If datasets have no records, we don't create archives for them. The previous archives have in those cases remained available in the past. However, I went through recently and cleared out old archives that were not being refreshed because of errors or not having any records. When doing that, I didn't change their "status" in collections.ala.org.au at that point, because I was unaware of its existence at the time. In the future if I detect these errors I will add the status change to the todo list for fixing them.
@ansell thanks for your clarifying the background. Hoping to get started on integrating the ALA so that the 404s can be easily picked up by Preston and others who might be interested. Does it makes sense to leave this issue open until the missing archives are removed / re-generated?
Yes, I have a separate task open for regenerating the files and will do a verification after that process.
hey @ansell - just checking in on this issue. Did get a chance to scrub the ala dataset archives? Would this be a good time to start indexing them?
@ansell was just looking at indexing the ALA bio-cache - is this a good time to start indexing the ala datasets?
Hi,
I tried to regenerate the archives, but my plan was foiled by some software issues that will need a software engineer to look at them.
Have you run your code recently to know where the remaining issues are?
Thanks,
Peter
Just completed a Preston run using newly added ALA support (https://github.com/bio-guoda/preston/issues/1) and found that, out of 1778 active archive uls (https://collections.ala.org.au/ws/dataResource?status=dataAvailable), 895 were unavailable (or rotten) and 1883 were active. Total data volume ~5 GB. Does this reflect the size of the ALA corpus?
Please see attached lists for more info.
The majority of those that I have reviewed are unpublished datasets from our internal data collection systems, DigiVol ( https://volunteer.ala.org.au/ ), and Biocollect ( https://biocollect.ala.org.au/ ). We are storing the metadata for those in collections.ala.org.au
, and although they currently say dataAvailable
, their records are not currently in biocache.ala.org.au
so they will not appear in the dumps (not sure what a better way for that would be).
At some point in the mid-term (1-2 years) we will be switching collections.ala.org.au to use the GBIF registry software so I doubt that changes to collections.ala.org.au to change this behaviour will be made before then.
I will test the export with the latest biocache-store snapshot today to see how it goes, which may add some datasets that are published, but weren't exported yet, but it won't pick up the DigiVol and Biocollect unpublished cases.
Some of them are also lists from https://lists.ala.org.au/
, which may be "private" in some cases, but in most cases you can get those using the Lists API. https://api.ala.org.au/#ws92
You may have more success looking for dumps with this query, which doesn't include species lists:
https://collections.ala.org.au/ws/dataResource?status=dataAvailable&resourceType=records
I have copied all of our latest archive exports and there are some new data resources since last time, so you should see a drop in the number missing out of the 754 data resources that you see with the query above.
Regarding the sizes, we are sending a limited subset of fields to GBIF, so the size doesn't reflect the total data resource sizes:
Thanks for all the info: very insightful.
re: resource sizes. This makes me wonder: Is there another way to access ALA records that better reflects the ALA corpus (incl. checklists excl. images)?
PS. Nice to see that you are using scala!
Hi!
I found https://collections.ala.org.au/ws/dataResource/dr3561 via https://collections.ala.org.au/ws/dataResource but got a 404 when retrieving the content associated to the public archive url http://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip . Is this expected?
Related to https://github.com/bio-guoda/preston/issues/1 .