AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

404's on retrieving data archives via bio-cache #272

Open jhpoelen opened 6 years ago

jhpoelen commented 6 years ago

Hi!

I found https://collections.ala.org.au/ws/dataResource/dr3561 via https://collections.ala.org.au/ws/dataResource but got a 404 when retrieving the content associated to the public archive url http://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip . Is this expected?

Related to https://github.com/bio-guoda/preston/issues/1 .

$ wget https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip
--2018-09-08 16:56:42--  https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip
Resolving biocache.ala.org.au (biocache.ala.org.au)... 54.79.49.195, 52.65.238.196
Connecting to biocache.ala.org.au (biocache.ala.org.au)|54.79.49.195|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-09-08 16:56:43 ERROR 404: Not Found.
ansell commented 6 years ago

The underlying reason is that we don't generate archives for data resources that have been registered but do not yet have any records loaded.

It would be ideal if the collections.ala.org.au service didn't generate the URLs in this case to avoid confusion. I have a feeling that the archive URL is generated by code using a pattern without checking if the archive exists or there are records in biocache.ala.org.au for the data resource.

jhpoelen commented 6 years ago

@ansell thanks for the clarification. I was wondering whether one could use the "status" fields (e.g., "status" : "identified", see below) as a way to filter out the resources that have been registered, but not yet loaded. If so, which status would indicate a loaded resource?

From https://collections.ala.org.au/ws/dataResource/dr3561 -

{
  "name": "Aboriginal Cultural Heritage (NLP)",
  "acronym": null,
  "uid": "dr3561",
  "guid": null,
  "address": null,
  "phone": null,
  "email": null,
  "pubShortDescription": null,
  "pubDescription": "The Aboriginal Cultural Heritage project increases Aboriginal engagement and participation in sustainable NRM as a part of the Sustainable Communities program.",
  "techDescription": null,
  "focus": null,
  "state": null,
  "websiteUrl": null,
  "alaPublicUrl": "https://collections.ala.org.au/public/show/dr3561",
  "networkMembership": null,
  "hubMembership": [],
  "taxonomyCoverageHints": [],
  "attributions": [],
  "dateCreated": "2016-02-14T11:35:57Z",
  "lastUpdated": "2016-02-14T11:35:57Z",
  "userLastModified": "Data services",
  "provider": {
    "name": "Aboriginal Cultural Heritage (NLP)",
    "uri": "https://collections.ala.org.au/ws/dataProvider/dp2243",
    "uid": "dp2243"
  },
  "rights": null,
  "licenseType": "other",
  "licenseVersion": null,
  "citation": null,
  "resourceType": "records",
  "dataGeneralizations": null,
  "informationWithheld": null,
  "permissionsDocument": null,
  "permissionsDocumentType": "Other",
  "contentTypes": [],
  "hasMappedCollections": false,
  "status": "identified",
  "provenance": null,
  "harvestFrequency": 0,
  "lastChecked": null,
  "dataCurrency": null,
  "harvestingNotes": null,
  "publicArchiveAvailable": false,
  "publicArchiveUrl": "http://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip",
  "gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip",
  "downloadLimit": 0,
  "gbifDataset": false,
  "isShareableWithGBIF": true,
  "verified": false,
  "gbifRegistryKey": null,
  "doi": null
}
jhpoelen commented 6 years ago

Or perhaps "publicArchiveAvailable": false ?

ansell commented 6 years ago

The best I can recommend at this point is checking the value of the status field, to check if it is set to dataAvailable, rather than identified or something else. I am not very familiar with the collections.ala.org.au API, but looking at some datasets that do have archives, it appears that the status field is the best avenue.

publicArchiveAvailable sounds like it should be the way to go, but it appears to be inconsistent, with it being false on some datasets I have looked at so far which have archives available.

jhpoelen commented 6 years ago

Thanks for sharing your insights.

I poked around and found an example (dr6504) with "status" : "dataAvailable", but seems to have a url that is a 404 (e.g., http://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip ) .

In this particular example, the "publicArchiveAvailable" : false .

One idea would be to run Preston on all data resource and measure which archive urls are active and which are not. Let me know if you'd find this useful.

{
  "name": "Advisory List of Threatened Invertebrate Fauna in Victoria 2009",
  "acronym": null,
  "uid": "dr6504",
  "guid": null,
  "address": null,
  "phone": null,
  "email": null,
  "pubShortDescription": null,
  "pubDescription": "Victorian advisory list for Invertebrate Fauna",
  "techDescription": "This list was first uploaded by Paul Skeen on the Wed Oct 12 01:06:06 UTC 2016.It contains [totalRecords:636, successfulItems:0] taxa.",
  "focus": null,
  "state": null,
  "websiteUrl": "http://lists.ala.org.au/speciesListItem/list/dr6504?max=10",
  "alaPublicUrl": "https://collections.ala.org.au/public/show/dr6504",
  "networkMembership": null,
  "hubMembership": [],
  "taxonomyCoverageHints": [],
  "attributions": [],
  "dateCreated": "2016-10-12T01:06:27Z",
  "lastUpdated": "2016-10-12T01:06:28Z",
  "userLastModified": "Species list upload",
  "rights": null,
  "licenseType": "other",
  "licenseVersion": null,
  "citation": null,
  "resourceType": "species-list",
  "dataGeneralizations": null,
  "informationWithheld": null,
  "permissionsDocument": null,
  "permissionsDocumentType": "Other",
  "contentTypes": [
    "species list"
  ],
  "hasMappedCollections": false,
  "status": "dataAvailable",
  "provenance": null,
  "harvestFrequency": 0,
  "lastChecked": null,
  "dataCurrency": null,
  "harvestingNotes": null,
  "publicArchiveAvailable": false,
  "publicArchiveUrl": "http://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip",
  "gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip",
  "downloadLimit": 0,
  "gbifDataset": false,
  "isShareableWithGBIF": true,
  "verified": false,
  "gbifRegistryKey": null,
  "doi": null
}
ansell commented 6 years ago

Also filtering for only "resourceType": "records" could be useful. In this case dr6504 represents a species-list, which are located in lists.ala.org.au and are not occurrence records so we don't currently dump them to Darwin Core Archives. Species lists should be representable using Darwin Core Archives, with a different rowType to occurrence records, so they could be exported in future.

jhpoelen commented 6 years ago

Using your suggestion, I found a dataResource with "status": "dataAvailable" and "resourceType": "records" with a 404 at https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip. Can you reproduce? If so, please let me know if there's additional criteria that can be used. Thanks for your patience.

{
  "name": "AIMS - LTM Nearshore Corals (OBIS Australia)",
  "acronym": null,
  "uid": "dr122",
  "guid": null,
  "address": null,
  "phone": null,
  "email": null,
  "pubShortDescription": null,
  "pubDescription": "Surveys of coral species richness were carried out at nearshore reefs of the Great Barrier Reef, Australia in conjunction with surveys of size structure and percentage cover of hard and soft coral communities. Species lists (Presence / Absence) were compiled at 2m and 5m below datum at two sites on 33 reefs between Mackay and Cooktown (latitude 16-23 degrees South) in 2004. The aim of the study was to document the status of nearshore coral communities in this region to serve both as a baseline against which future change could be compared and also identify communities potentially at risk from anthropogenic activities. Hard corals were identified to species level (although on occasion identification was limited to genus) and soft corals were identified to genus.",
  "techDescription": null,
  "focus": null,
  "state": null,
  "websiteUrl": null,
  "alaPublicUrl": "https://collections.ala.org.au/public/show/dr122",
  "networkMembership": null,
  "hubMembership": [
    {
      "uid": "dh3",
      "name": "Ocean Biogeographic Information System",
      "uri": "https://collections.ala.org.au/ws/dataHub/dh3"
    }
  ],
  "taxonomyCoverageHints": [],
  "attributions": [],
  "dateCreated": "2010-09-14T00:05:26Z",
  "lastUpdated": "2011-07-05T04:21:50Z",
  "userLastModified": "David.Martin@csiro.au",
  "provider": {
    "name": "Institute of Marine and Coastal Sciences, Rutgers University",
    "uri": "https://collections.ala.org.au/ws/dataProvider/dp18",
    "uid": "dp18"
  },
  "rights": "Acknowledge the use of records from this dataset in the form appearing in the 'Citation' field and acknowledge the use of the OBIS facility. Recognise the limitatons of data in OBIS.",
  "licenseType": null,
  "licenseVersion": null,
  "citation": "AIMS - Status of Nearshore Reefs of the GBR: H Sweatman, A Thompson, S Delean, J Davidson, S Neale",
  "resourceType": "records",
  "dataGeneralizations": null,
  "informationWithheld": null,
  "permissionsDocument": null,
  "permissionsDocumentType": null,
  "contentTypes": [],
  "connectionParameters": {
    "protocol": "DIGIR",
    "resource": "aims_ltm_ns",
    "url": "http://iobis.marine.rutgers.edu/digir2/DiGIR.php",
    "termsForUniqueKey": [
      "institutionCode",
      "collectionCode",
      "catalogNumber"
    ]
  },
  "hasMappedCollections": false,
  "status": "dataAvailable",
  "provenance": "Published dataset",
  "harvestFrequency": 0,
  "lastChecked": null,
  "dataCurrency": null,
  "harvestingNotes": null,
  "publicArchiveAvailable": false,
  "publicArchiveUrl": "http://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip",
  "gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip",
  "downloadLimit": 0,
  "gbifDataset": false,
  "isShareableWithGBIF": true,
  "verified": false,
  "gbifRegistryKey": null,
  "doi": null
}
ansell commented 6 years ago

That data resource had records in the past (the collections.ala.org.au page shows downloads of records in the past, but none recently https://collections.ala.org.au/public/show/dr122 ), but has had all of its records deleted since then. Because it had records in the past, it likely received the dataAvailable flag at some point and has not had it revoked when the records were deleted.

I have switched its dataAvailable flag back to identified.

There are possibly others in the same category that you can identify using the "Usage statistics" on the public collections HTML page. Otherwise, once those issues are cleared, dataAvailable should be a fairly reliable flag.

More generally, we have not exported data for a few months while we have transitioned to a new version of biocache-store/biocache-service/cassandra/solr, but it is on our todo list to refresh the archive dumps and get them back to being automatically refreshed monthly. You may also find some new datasets that have been created and loaded into the new system for the first time which will not have exported archives yet, but will after we restart the archive creation process.

ansell commented 6 years ago

Just to clarify the underlying cause for dr122 a little further. If datasets have no records, we don't create archives for them. The previous archives have in those cases remained available in the past. However, I went through recently and cleared out old archives that were not being refreshed because of errors or not having any records. When doing that, I didn't change their "status" in collections.ala.org.au at that point, because I was unaware of its existence at the time. In the future if I detect these errors I will add the status change to the todo list for fixing them.

jhpoelen commented 6 years ago

@ansell thanks for your clarifying the background. Hoping to get started on integrating the ALA so that the 404s can be easily picked up by Preston and others who might be interested. Does it makes sense to leave this issue open until the missing archives are removed / re-generated?

ansell commented 6 years ago

Yes, I have a separate task open for regenerating the files and will do a verification after that process.

jhpoelen commented 5 years ago

hey @ansell - just checking in on this issue. Did get a chance to scrub the ala dataset archives? Would this be a good time to start indexing them?

jhpoelen commented 5 years ago

@ansell was just looking at indexing the ALA bio-cache - is this a good time to start indexing the ala datasets?

ansell commented 5 years ago

Hi,

I tried to regenerate the archives, but my plan was foiled by some software issues that will need a software engineer to look at them.

Have you run your code recently to know where the remaining issues are?

Thanks,

Peter

jhpoelen commented 5 years ago

Just completed a Preston run using newly added ALA support (https://github.com/bio-guoda/preston/issues/1) and found that, out of 1778 active archive uls (https://collections.ala.org.au/ws/dataResource?status=dataAvailable), 895 were unavailable (or rotten) and 1883 were active. Total data volume ~5 GB. Does this reflect the size of the ALA corpus?

Please see attached lists for more info.

active-urls.txt rotten-urls.txt

ansell commented 5 years ago

The majority of those that I have reviewed are unpublished datasets from our internal data collection systems, DigiVol ( https://volunteer.ala.org.au/ ), and Biocollect ( https://biocollect.ala.org.au/ ). We are storing the metadata for those in collections.ala.org.au, and although they currently say dataAvailable, their records are not currently in biocache.ala.org.au so they will not appear in the dumps (not sure what a better way for that would be).

At some point in the mid-term (1-2 years) we will be switching collections.ala.org.au to use the GBIF registry software so I doubt that changes to collections.ala.org.au to change this behaviour will be made before then.

I will test the export with the latest biocache-store snapshot today to see how it goes, which may add some datasets that are published, but weren't exported yet, but it won't pick up the DigiVol and Biocollect unpublished cases.

ansell commented 5 years ago

Some of them are also lists from https://lists.ala.org.au/ , which may be "private" in some cases, but in most cases you can get those using the Lists API. https://api.ala.org.au/#ws92

ansell commented 5 years ago

You may have more success looking for dumps with this query, which doesn't include species lists:

https://collections.ala.org.au/ws/dataResource?status=dataAvailable&resourceType=records

ansell commented 5 years ago

I have copied all of our latest archive exports and there are some new data resources since last time, so you should see a drop in the number missing out of the 754 data resources that you see with the query above.

ansell commented 5 years ago

Regarding the sizes, we are sending a limited subset of fields to GBIF, so the size doesn't reflect the total data resource sizes:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/export/DwCACreator.scala#L32-L74

jhpoelen commented 5 years ago

Thanks for all the info: very insightful.

re: resource sizes. This makes me wonder: Is there another way to access ALA records that better reflects the ALA corpus (incl. checklists excl. images)?

PS. Nice to see that you are using scala!