RSS feed for data sets - Githubissues

M-Nicholls commented 8 years ago

requested by iDigBio:

Would it be possible for you to install an RSS feed file with your datasets? A full self-serve PHP script is available and short of getting it out of Github, all that would be required is to maintain the parameter files: https://github.com/iDigBio/idigbio-feeder

feed.csv datasets.csv and to run PHP on your server. If you would prefer not to run the dynamic PHP using the parameter files, you could also make it static as a .xml file, and then do the pubDate updates manually.

I know you said 'No' to IPT, but could you be convinced to install the RSS.php file?

Instructions in progress: https://www.idigbio.org/wiki/index.php/CYWG_iDigBio_DwC-A_Pull_Ingestion

Our cyberinfrastructure team is in the midst of developing a sustainable feed mechanism for datasets that do not use IPT. We installed our own feed and also our own IPT, and they are asking for all non-IPT sources to install the RSS feed file.

danstoner commented 8 years ago

Let us (iDigBio) know if you have any questions about the format of the RSS feed. The "Instructions in progress" link in the previous comment ought to have enough samples to template feed generation in the method of your choice. Also, the Requirements section may be helpful:

https://www.idigbio.org/wiki/index.php/CYWG_iDigBio_DwC-A_Pull_Ingestion#Requirements

danstoner commented 8 years ago

Touching base again on the RSS Feed. John La Salle attended the iDigBio Summit and my managers at iDigBio are very interested in hearing that the collaboration has moved forward.

The RSS requirements are fairly simple. When a new data file is generated the RSS feed should be updated with a new publish date.

nickdos commented 8 years ago

The Collectory is a Java/Grails application and our deploy scripts are not setup to handle PHP, so this is unlikely to happen.

There is an RSS plugin here: https://grails.org/plugin/feeds, so its a possibilty. But a low priority for us right now.

Its unclear to me what the "use case" is for us providing this RSS feed.

danstoner commented 8 years ago

Hi Nick,

Thanks for the response.

I think iDigBio needs to de-emphasize the PHP notion. We provide that as an example for anyone who needs it but it shouldn't be presented as a request or requirement. iDigBio definitely has no preference for PHP as an RSS generator.

I updated iDigBio's RSS feed guidelines just today with some additional examples. These may help you decide what to put in your feed (e.g. title, description, link, ...) if you are able to prioritize this work.

https://www.idigbio.org/wiki/index.php/CYWG_iDigBio_DwC-A_Pull_Ingestion#Requirements

Use case.

Will it be sufficient if I can get John La Salle to say "pretty please" ?

;)

From iDigBio's perspective, RSS seems to be a de facto part of the community "data publishing workflow". IPT, Symbiota, and almost all of the other data publishing tools we experience have embraced RSS as the mechanism for revealing data file updates to the world. So far, it has only been collections / providers with zero technical resource available who have been unable to provide an RSS feed. I see lots of evidence that ALA does not fall into that category.

I have to believe there is additional utility beyond iDigBio ingestion and that other potential data consumers would make use of an RSS feed. Other data aggregators such as VertNet come to mind as likely beneficiaries.

There probably isn't a practical way for me to contribute to ALA directly to make this happen, but I personally feel that my time would be better invested helping ALA craft an RSS solution than putting that same time into building a "one-off" iDigBio ingestion process to deal with the lack of RSS feed. The one-off solution provides a smaller benefit to the world than a feed provided by ALA itself.

Thanks for your consideration.

nickdos commented 8 years ago

Some notes for implementing this...

EML file example: http://collections.ala.org.au/eml/dr820
identifier: EML has 2 so need to decide which to use (drNNN vs UUID)
data link to DwC archive - will link to biocache dynamic download (zip file)
publication date - dataCurrency date (last harvest & index)

nickdos commented 8 years ago

First attempt deployed today...

http://collections.ala.org.au/feed.xml

danstoner commented 8 years ago

Bug Report:

The "file" parameter seems to be hard-coded to "data-resource-dr968" with the suffix not changing for each item. The "q" parameter changes for each item but the "file" parameter remains the same for all items in the feed.

Expected behavior is that the resource dr349 would have a link to download filename "data-resource-dr349.zip" intead of "data-resource-dr968.zip".

Example: dr349 / CSIRO Ichthyology provider for OZCAM

http://biocache.ala.org.au/ws/occurrences/index/download?sourceTypeId=0&reasonTypeId=9&file=data-resource-dr968&q=data_resource_uid%3Adr349

Note the "file" parameter in the "link" field (file=data-resource-dr968)

The contents of the actual download (the stuff inside the zip archive) change for each item (my computer generates data-resource-dr968.zip.1 data-resource-dr968.zip.2 data-resource-dr968.zip.3 ...), so it is really only the filename that is getting duplicated.

nickdos commented 8 years ago

Thanks Dan - missed that value so it was hard coded, sorry. Fixed but not yet deployed.

Edit: now deployed.

danstoner commented 8 years ago

Hi Nick,

I had a chance to download the files from the RSS feed and compare them to the DwC-A that had been shared previously.

It seems that the previous files (files sitting on an HTTP server) were proper darwin core archives. The files linked in the RSS feed come via your Download API and seem to be quite different.

Is it possible for the RSS feed to link to darwin core archives?

# I downloaded the DwC-A files and extracted into subdirectories with distinct suffix.
# For example, http://biocache.ala.org.au/archives/dr340/dr340_ror_dwca.zip I extracted
# to a directory named "dr340"

dstoner@dstoner-thinkster:~/Downloads/ALA$ ls -R
.:
dr340  dr340_ror_dwca.zip  dr349  dr349_ror_dwca.zip  dr367  dr367_ror_dwca.zip  dr376  dr376_ror_dwca.zip  dr742  dr742_ror_dwca.zip  dr90  dr90_ror_dwca.zip  from_feed

./dr340:
eml.xml  meta.xml  occurrence.csv

./dr349:
eml.xml  meta.xml  occurrence.csv

./dr367:
eml.xml  meta.xml  occurrence.csv

./dr376:
eml.xml  meta.xml  occurrence.csv

./dr742:
eml.xml  meta.xml  occurrence.csv

./dr90:
eml.xml  meta.xml  occurrence.csv

# Files linked from from RSS feed at http://collections.ala.org.au/feed.xml I did similar
# but in a subdirectory "from_feed".

./from_feed:
data-resource-dr340.zip  data-resource-dr349.zip  data-resource-dr367.zip  data-resource-dr376.zip  data-resource-dr742.zip  data-resource-dr90.zip  dr340  dr349  dr367  dr376  dr742  dr90

./from_feed/dr340:
citation.csv  data-resource-dr340.csv  headings.csv  README.html

./from_feed/dr349:
citation.csv  data-resource-dr349.csv  headings.csv  README.html

./from_feed/dr367:
citation.csv  data-resource-dr367.csv  headings.csv  README.html

./from_feed/dr376:
citation.csv  data-resource-dr376.csv  headings.csv  README.html

./from_feed/dr742:
citation.csv  data-resource-dr742.csv  headings.csv  README.html

./from_feed/dr90:
citation.csv  data-resource-dr90.csv  headings.csv  README.html

Then I tried to find a data row to compare how it looked from a "feed" data file vs. the static DwC-A.

# Columns and first data row from an RSS feed data file

dstoner@dstoner-thinkster:~/Downloads/ALA$ head -n2 ./from_feed/dr340/data-resource-dr340.csv
"Record ID","Catalog Number","Match Taxon Concept GUID","Scientific Name","Vernacular Name","Matched Scientific Name","Taxon Rank - matched","Vernacular Name - matched","Kingdom - matched","Phylum - matched","Class - matched","Order - matched","Family - matched","Genus - matched","Species - matched","Institution Code","Collection Code","locality","Latitude - original","Longitude - original","geodetic Datum","Latitude - processed","Longitude - processed","Coordinate Uncertainty in Metres - parsed","Country - parsed","State - parsed","Local Government Areas","Collector","Year - parsed","Month - parsed","Event Date - parsed","Basis Of Record - original","Basis Of Record - processed","Sex","Outlier for layer","Taxon identification issue","Occurrence status assumed to be present","Coordinate precision not valid","Coordinates centre of country","Supplied coordinates centre of state","Coordinates dont match supplied country","Country inferred from coordinates","Coordinates derived from verbatim coordinates","Suspected outlier","First of the century","First of the month","First of the year","Geodetic datum assumed WGS84","Habitat incorrect for species","Incomplete collection date","Possible duplicate record","Invalid collection date","Invalid scientific name","Name not in national checklists","Name not recognised","Latitude is negated","Longitude is negated","Resource taxonomic scope mismatch","Outside expert range for species","Coordinates dont match supplied state","Supplied country not recognised","Kingdom not recognised","Type status not recognised","Supplied coordinates are zero","Zero latitude","Zero longitude" 
"ac9cff09-a4b6-42af-999f-a507d04bd369","R.2305","urn:lsid:biodiversity.org.au:afd.taxon:2214e688-6892-4e88-a23f-2a598d256b58","Hemiaspis signata (Jan, 1859)","Black-bellied Swamp Snake","Hemiaspis signata","species","Black-bellied Swamp Snake","ANIMALIA","CHORDATA","REPTILIA","SQUAMATA","ELAPIDAE","Hemiaspis","Hemiaspis signata","AM","Herpetology","","-28.433","153.466","","-28.433","153.466","10000.0","Australia","New South Wales","Tweed (A)","McCooey, H. J.","","","","PreservedSpecimen","PreservedSpecimen","","","noIssue","true","false","false","false","false","false","false","false","false","false","false","true","false","false","false","false","false","false","false","false","false","false","false","false","false","false","false","false","false","false" 

# Data row matching the same uuid (e.g. the "same" row as above but from the previously generated static DwC-A)

dstoner@dstoner-thinkster:~/Downloads/ALA$ grep ac9cff09-a4b6-42af-999f-a507d04bd369 dr340/occurrence.csv 
"ac9cff09-a4b6-42af-999f-a507d04bd369","R.2305","Herpetology","AM","Hemiaspis signata (Jan, 1859)","McCooey, H. J.","","Animalia","Chordata","","","Elapidae","Hemiaspis","signata","","-28.433","153.466","0.001","10000","","","","","","Australia","New South Wales","","","","","","PreservedSpecimen","","","","ecatalogue.LocCollectionEventLocal: ""Australia, New South Wales, Burringbar (28° 26' S, 153° 28' E), McCooey, H. J.(Collector), Accession"";","","Black-bellied Swamp Snake","","","urn:australianmuseum.net.au:Events:2005411","","","",""

nickdos commented 8 years ago

Hi Dan, I did wonder about that when I was implementing it. We don't have DwA for many of the datasets IIRC, so the "raw" download format has much better coverage (everything). I'll need to work out how to get the subset of datasets we have DwC archives for and then restrict the feed to that list I think. I'll talk to @M-Nicholls whether this is right, etc.

danstoner commented 8 years ago

@nickdos, @M-Nicholls - Just a ping to see if there is any status change.

nickdos commented 8 years ago

Thanks for the reminder @danstoner. I've put it back in the ToDo pipeline. We could provide a proper DwC CSV for all the data resources or limit the feed to a subset of data resources where we have a raw (unprocessed) DwC archive file. Which is better for you?

danstoner commented 8 years ago

We (iDigBio) are only going to be interested in Occurrence data (not observations) due to our mandate, so we definitely will only be pulling in a subset of what you aggregate in total.

Within the Occurrence data, we try to only take datasets with records identified by robust / globally unique identifiers as these are more useful in the aggregate. So, datasets with incrementing integer OccurenceID fields (1, 2, 3, ...) are less attractive than UUID or urn:catalog style identifiers.

We generally look at a feed, inspect each resource individually, and make a decision on each one that meets the above criteria.

We would enjoy a feed with only Occurrence datasets listed, but I don't want that preference to hinder your progress in any way.

To be very specific, the last time we looked, the following are the dataset files that seemed ready for uptake by iDigBio:

http://biocache.ala.org.au/archives/dr340/dr340_ror_dwca.zip http://biocache.ala.org.au/archives/dr742/dr742_ror_dwca.zip http://biocache.ala.org.au/archives/dr130/dr130_ror_dwca.zip http://biocache.ala.org.au/archives/dr367/dr367_ror_dwca.zip http://biocache.ala.org.au/archives/dr349/dr349_ror_dwca.zip http://biocache.ala.org.au/archives/dr90/dr90_ror_dwca.zip http://biocache.ala.org.au/archives/dr376/dr376_ror_dwca.zip

danstoner commented 8 years ago

@nickdos, @M-Nicholls - Hope you are well. Here is my semi-monthly ping to inquire if there has been any activity on this issue.

danstoner commented 7 years ago

Noting that the RSS feed at http://collections.ala.org.au/feed.xml is still currently linking to files that are not DwC-A, just updating ticket since iDigBio leadership asked about ALA / mentioned this recently.

http://biocache.ala.org.au/ws/occurrences/index/download?sourceTypeId=0&reasonTypeId=9&file=data-resource-dr340&q=data_resource_uid%3Adr340

$ head -n2 data-resource-dr340.csv 
"Record ID","Catalog Number","Match Taxon Concept GUID","Scientific Name","Vernacular Name","Matched Scientific Name","Taxon Rank - matched","Vernacular Name - matched","Kingdom - matched","Phylum - matched","Class - matched","Order - matched","Family - matched","Genus - matched","Species - matched","Subspecies - matched","Institution Code","Collection Code","locality","Latitude - original","Longitude - original","geodetic Datum","Latitude - processed","Longitude - processed","Coordinate Uncertainty in Metres - parsed","Country - parsed","IBRA 7 Regions","IMCRA 4 Regions","State - parsed","Local Government Areas","Minimum Elevation In Metres","Maximum Elevation In Metres","Minimum Depth In Meters","Maximum Depth In Meters","Collector","Year - parsed","Month - parsed","Event Date - parsed","Basis Of Record - original","Basis Of Record - processed","Sex","Outlier for layer","Taxon identification issue","Location Quality","Occurrence status assumed to be present","Coordinate precision not valid","Coordinates centre of country","Supplied coordinates centre of state","Coordinates dont match supplied country","Country inferred from coordinates","Coordinates derived from verbatim coordinates","Suspected outlier","First of the century","First of the month","First of the year","Geodetic datum assumed WGS84","Habitat incorrect for species","Incomplete collection date","Possible duplicate record","Invalid collection date","Invalid scientific name","Name not in national checklists","Name not recognised","Latitude is negated","Longitude is negated","Resource taxonomic scope mismatch","Outside expert range for species","Coordinates dont match supplied state","Supplied country not recognised","Kingdom not recognised","Type status not recognised","Supplied coordinates are zero","Zero latitude","Zero longitude"
"56a6ffa2-f6a8-4d40-89a5-430aaff9e5eb","C.33877","","Turris babylonia (Linnaeus, 1758)","","","","","","","","","","","","","AM","Malacology","","-4.200","152.183","","-4.2","152.183","10000.0","Papua New Guinea","","","","","","","","","locals","1975","","","PreservedSpecimen","PreservedSpecimen","","","noIssue","true","true","false","false","false","false","false","false","false","false","false","false","true","false","true","false","false","false","true","false","false","false","true","false","false","true","false","false","false","false","false"

nickdos commented 7 years ago

Hi Dan, sorry about the lack of progress on this. We're running 2 developers down on our normal team size and have had a big project take all our time for the last 6 months. We currently don't produce DwC-A format - just CSV with DwC named headers (which is new since the last update - they were DwC-like last time). SO if you guys must have DwC-A then this issue will be waiting for a while longer. However if regular DwC/CSV is OK and out format is slightly wrong then we could make minor changes to accommodate that. We previously did supply some DwC-A files but only for a very small number of data resources where the data came to us already in DwC-A format - just for explanation.

danstoner commented 7 years ago

Hi Nick, thanks for the update.

Could you please explain the data flow / update cycle of the following datasets?

http://biocache.ala.org.au/archives/dr340/dr340_ror_dwca.zip http://biocache.ala.org.au/archives/dr742/dr742_ror_dwca.zip http://biocache.ala.org.au/archives/dr130/dr130_ror_dwca.zip http://biocache.ala.org.au/archives/dr367/dr367_ror_dwca.zip http://biocache.ala.org.au/archives/dr349/dr349_ror_dwca.zip http://biocache.ala.org.au/archives/dr90/dr90_ror_dwca.zip http://biocache.ala.org.au/archives/dr376/dr376_ror_dwca.zip

I think these are the files that are consumed by GBIF (based on dataset information in GBIF portal last time I checked) and generally iDigBio has good luck ingesting files that GBIF is already able to consume.

Thanks!

nickdos commented 7 years ago

I think my last sentence above covers these data... i.e. the data was provided to us in this format already. but I'll double check that, as I could be totally wrong.

M-Nicholls commented 7 years ago

Actually, those archives for GBIF are produced by a scheduled job that runs around the 15th of each month and produces a DwC-A of every data set. It's not working properly at the moment due to moving servers and the scheduler needing permission to copy the files to the new server but we're working on fixing that now.

danstoner commented 7 years ago

That's good news.

The ideal scenario for us then would be just a minor modification to the RSS feed.

For example, looking at the dr340 resource:

<item>
    <title>Australian Museum provider for OZCAM</title>
    <link>http://biocache.ala.org.au/ws/occurrences/index/download?sourceTypeId=0&amp;reasonTypeId=9&amp;file=data-resource-dr340&amp;q=data_resource_uid%3Adr340</link>
    <pubDate>Sun, 26 Jun 2016 05:21:27 +1000</pubDate>
    <description><![CDATA[ Australian Museum provider for OZCAM ]]></description>
    <guid isPermaLink="false">http://collections.ala.org.au/public/showDataResource/dr340</guid>
    <ipt:eml>http://collections.ala.org.au/eml/dr340</ipt:eml>
</item>

If the <link> field contained:

http://biocache.ala.org.au/archives/dr340/dr340_ror_dwca.zip

instead of

http://biocache.ala.org.au/ws/occurrences/index/download?sourceTypeId=0&amp;reasonTypeId=9&amp;file=data-resource-dr340&amp;q=data_resource_uid%3Adr340

I think it would be a great situation.

danstoner commented 7 years ago

I still see value in ALA providing its own RSS feed for us and other potential consumers of ALA data, but iDigBio specifically does not require that feature any longer.

As far as I am concerned it is ok if you wish to close this github issue.

Thank you for your time on this issue!

AtlasOfLivingAustralia / collectory-plugin

RSS feed for data sets #54