VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

Tulane DwCA Testing #107

Closed laurarussell closed 10 years ago

laurarussell commented 10 years ago

Need to work with @eightysteele or @robinkraft to harvest the Tulane non-ipt generated DwCA file. I need to detail what issues, if any, we encounter when trying to harvest so that we can get this information back to Tulane so they can remedy any issues so we can begin to harvest their data. This will be a test case for any other non-ipt generated DwCA files we are likely to get from institutions like OZCAM or BIOCASE providers.

The Tulane DwCA file is located at:

http://data.tubri.org/datasets/dwca-tufish.zip

eightysteele commented 10 years ago

This will be tricky since we depend on IPT for a bunch of things in the harvest process. Easy for them or us to publish via IPT?

On Thu, Oct 17, 2013 at 9:13 AM, laurarussell notifications@github.comwrote:

Need to work with @eightysteele https://github.com/eightysteele or @robinkraft https://github.com/robinkraft to harvest the Tulane non-ipt generated DwCA file. I need to detail what issues, if any, we encounter when trying to harvest so that we can get this information back to Tulane so they can remedy any issues so we can begin to harvest their data. This will be a test case for any other non-ipt generated DwCA files we are likely to get from institutions like OZCAM or BIOCASE providers.

The Tulane DwCA file is located at:

http://data.tubri.org/datasets/dwca-tufish.zip

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107 .

robgur commented 10 years ago

Hey, @eightysteele, @robinkraft, @laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt (see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele

eightysteele commented 10 years ago

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkrafthttps://github.com/robinkraft, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt (see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteelehttps://github.com/eightysteele

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26730895 .

robgur commented 10 years ago

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele notifications@github.comwrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt (see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26732099 .

eightysteele commented 10 years ago

Laura, easy to publish these via our ipt installation? On Oct 21, 2013 9:26 AM, "Rob" notifications@github.com wrote:

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele notifications@github.comwrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt (see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732099> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26732629 .

robgur commented 10 years ago

I don't think we can republish those, Aaron, because the providers have explicitly decided on an alternate mechanism and we don't want to trump that, right? I think we can ask publishers for a small set of information needed basically "push" those into VN in a way that looks like what we might get from IPT. Its a subtle but important distincton (I think? Its a better question for Laura, Dave, John)

On Mon, Oct 21, 2013 at 10:28 AM, Aaron Steele notifications@github.comwrote:

Laura, easy to publish these via our ipt installation? On Oct 21, 2013 9:26 AM, "Rob" notifications@github.com wrote:

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele notifications@github.comwrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt (see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732099> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732629> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26732770 .

eightysteele commented 10 years ago

Rob, good call. Lemme take a closer look at the ipt deps... On Oct 21, 2013 9:32 AM, "Rob" notifications@github.com wrote:

I don't think we can republish those, Aaron, because the providers have explicitly decided on an alternate mechanism and we don't want to trump that, right? I think we can ask publishers for a small set of information needed basically "push" those into VN in a way that looks like what we might get from IPT. Its a subtle but important distincton (I think? Its a better question for Laura, Dave, John)

On Mon, Oct 21, 2013 at 10:28 AM, Aaron Steele notifications@github.comwrote:

Laura, easy to publish these via our ipt installation? On Oct 21, 2013 9:26 AM, "Rob" notifications@github.com wrote:

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele < notifications@github.com>wrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt(see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732099> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732629> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732770> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26733097 .

tucotuco commented 10 years ago

We can have resources on the VertNet IPT that are private and still use those for harvesting (the archive is still accessible from the URL even if the visibility is set to private). Why not just create resources for the orphans on VertNet and harvest them lime all the rest?

On Mon, Oct 21, 2013 at 6:32 PM, Rob notifications@github.com wrote:

I don't think we can republish those, Aaron, because the providers have explicitly decided on an alternate mechanism and we don't want to trump that, right? I think we can ask publishers for a small set of information needed basically "push" those into VN in a way that looks like what we might get from IPT. Its a subtle but important distincton (I think? Its a better question for Laura, Dave, John)

On Mon, Oct 21, 2013 at 10:28 AM, Aaron Steele notifications@github.comwrote:

Laura, easy to publish these via our ipt installation? On Oct 21, 2013 9:26 AM, "Rob" notifications@github.com wrote:

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele < notifications@github.com>wrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here: http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt(see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732099> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732629> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732770> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26733097 .

eightysteele commented 10 years ago

Boom, +1 for private resources. Rob, good?

On Tue, Oct 22, 2013 at 7:50 AM, John Wieczorek notifications@github.comwrote:

We can have resources on the VertNet IPT that are private and still use those for harvesting (the archive is still accessible from the URL even if the visibility is set to private). Why not just create resources for the orphans on VertNet and harvest them lime all the rest?

On Mon, Oct 21, 2013 at 6:32 PM, Rob notifications@github.com wrote:

I don't think we can republish those, Aaron, because the providers have explicitly decided on an alternate mechanism and we don't want to trump that, right? I think we can ask publishers for a small set of information needed basically "push" those into VN in a way that looks like what we might get from IPT. Its a subtle but important distincton (I think? Its a better question for Laura, Dave, John)

On Mon, Oct 21, 2013 at 10:28 AM, Aaron Steele notifications@github.comwrote:

Laura, easy to publish these via our ipt installation? On Oct 21, 2013 9:26 AM, "Rob" notifications@github.com wrote:

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele < notifications@github.com>wrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here:

http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt(see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732099> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732629> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732770> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26733097> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26809688 .

robgur commented 10 years ago

Yes but a worry --- we are undetaking a "publishing act", even if private, and I think we should ask permission of the resource provider before doing it --- a matter of trust, I guess? If we are going to republish via IPT anyway, why not ask them if this is something they want public?

On Tue, Oct 22, 2013 at 9:03 AM, Aaron Steele notifications@github.comwrote:

Boom, +1 for private resources. Rob, good?

On Tue, Oct 22, 2013 at 7:50 AM, John Wieczorek notifications@github.comwrote:

We can have resources on the VertNet IPT that are private and still use those for harvesting (the archive is still accessible from the URL even if the visibility is set to private). Why not just create resources for the orphans on VertNet and harvest them lime all the rest?

On Mon, Oct 21, 2013 at 6:32 PM, Rob notifications@github.com wrote:

I don't think we can republish those, Aaron, because the providers have explicitly decided on an alternate mechanism and we don't want to trump that, right? I think we can ask publishers for a small set of information needed basically "push" those into VN in a way that looks like what we might get from IPT. Its a subtle but important distincton (I think? Its a better question for Laura, Dave, John)

On Mon, Oct 21, 2013 at 10:28 AM, Aaron Steele < notifications@github.com>wrote:

Laura, easy to publish these via our ipt installation? On Oct 21, 2013 9:26 AM, "Rob" notifications@github.com wrote:

Maybe we can emulate an IPT installation for the providers who give us a DwC-A? Make sure we capture the exact same EML metadata and migration process? Is there a quick solutiion that might not be automated but that can push us forward?

On Mon, Oct 21, 2013 at 10:20 AM, Aaron Steele < notifications@github.com>wrote:

Absolutely. Can handle, just not in the next two weeks. Will need to prioritize and schedule in contract scope. Good? On Oct 21, 2013 9:06 AM, "Rob" notifications@github.com wrote:

Hey, @eightysteele https://github.com/eightysteele, @robinkraft< https://github.com/robinkraft>, @laurarussell https://github.com/laurarussell --- don't think we can enforce IPT publishing as a prereq. for joining VertNet, especially since we said it wasn't a prereq. to everyone and their brother here:

http://blog.vertnet.org/post/10209478183/vertnet-and-gbif-ipt(see "going forward..." section). I do think DwC-As are a pre-req. The question is how to handle non-IPT DwC-As and I am open to sugestions, @eightysteele< https://github.com/eightysteele>

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26730895>

.

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732099> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732629> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26732770> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26733097> .

— Reply to this email directly or view it on GitHub< https://github.com/VertNet/gulo/issues/107#issuecomment-26809688> .

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26810869 .

laurarussell commented 10 years ago

A private resource URL is only available when a registered user with permission to the resource is logged into the IPT. See http://ipt.vertnet.org:8080/ipt/archive.do?r=flmnh_birds (try this while you are not logged into IPT). Can the harvester login to IPT to grab the files?

But, I don't see this as an orphan situation. These institutions are making their DwC-A files publicly available on their websites. See: http://data.tubri.org/ or http://collections.ala.org.au/public/showDataResource/dr340.

From what I've seen on Tulane and OZCAM resources the meta data in the eml file is poor, but in the case of OZCAM the meta data on their website pages is rich. I don't want to be responsible for recreating IPT required (not DwC required) meta data and mappings (basisOfRecord) and required contacts as a workaround.

From where is the meta data pulled onto our portal pages? Is this coming from the eml files or is it coming from the html rendered page? Do we redirect the user back to the original source file from the URL contained in the resource_staging table? If so that URL is the pointer to the place to harvest the files in which case we'd have to add work more fields to redirect back to the source site on non-ipt created archives.

I guess I don't understand what about the process is so reliant on it coming from IPT? Shouldn't we be able to pull and load what is contained within the DwCA files no matter from where they come?

eightysteele commented 10 years ago

We're just looking for a short term fix for processing archives. Surfacing them in IPT is a short termer. The long term solution will come in the next weeks.

On Tue, Oct 22, 2013 at 8:11 AM, laurarussell notifications@github.comwrote:

A private resource URL is only available when a registered user with permission to the resource is logged into the IPT. See http://ipt.vertnet.org:8080/ipt/archive.do?r=flmnh_birds (try this while you are not logged into IPT). Can the harvester login to IPT to grab the files?

But, I don't see this as an orphan situation. These institutions are making their DwC-A files publicly available on their websites. See: http://data.tubri.org/ or http://collections.ala.org.au/public/showDataResource/dr340.

From what I've seen on Tulane and OZCAM resources the meta data in the eml file is poor, but in the case of OZCAM the meta data on their website pages is rich. I don't want to be responsible for recreating IPT required (not DwC required) meta data and mappings (basisOfRecord) and required contacts as a workaround.

From where is the meta data pulled onto our portal pages? Is this coming from the eml files or is it coming from the html rendered page? Do we redirect the user back to the original source file from the URL contained in the resource_staging table? If so that URL is the pointer to the place to harvest the files in which case we'd have to add work more fields to redirect back to the source site on non-ipt created archives.

I guess I don't understand what about the process is so reliant on it coming from IPT? Shouldn't we be able to pull and load what is contained within the DwCA files no matter from where they come?

— Reply to this email directly or view it on GitHubhttps://github.com/VertNet/gulo/issues/107#issuecomment-26811613 .

laurarussell commented 10 years ago

Well we don't yet have permission to use OZCAM stuff (been working on it) and Tulane hasn't complained yet about not being in the portal. So I hate to develop a work around if a solution is coming. I know (or think I know) that the Tulane file has issues. This is why I wanted you to try to harvest it as a stand-alone test case so we can pinpoint problems and potential problems for non-IPT generated DwCAs and get them straightened out. Me importing it into and publishing via IPT doesn't resolve issues with their DwCA file and what might break the harvester.

So I get why this may not be priority in the next week and a half, but it needs to be on list going forward.

tucotuco commented 10 years ago

Decided that VN will always havest from an IPT, even if we have to create an unregistered resource to do so. This has been implemented for Tulane, where we put the data after running it through a migrator.