EOL / tramea

A lightweight server for denormalized EOL data
Other
2 stars 1 forks source link

BHL resource fails to assign images to multiple taxa of the same rank #215

Closed jhammock closed 8 years ago

jhammock commented 8 years ago

This resource http://eol.org/content_partners/134/resources/544 behaves much like the flickr resource, and relies on taxonomic tags. Multiple tags are often used to provide higher taxonomy for a single taxon; this is common in the main flickr group, but rare in this one. Multiple tags may also be used to identify multiple taxa in one image: http://eol.org/data_objects/32778209. (very common in this resource)

In the main flickr group, two things happened, and I hope they happened in the connector, so that we have the option of setting them differently in this one:

The image was not attached to all taxa listed in the tags, but it was determined which was the target and the image was attached to that taxon The remaining tags were used to construct the higher taxonomy

For the BHL flickr group, this behavior works out badly. I think default behavior, and universal behavior if that's the only way, should be that the image is separately attached to all taxa in its tags.

I'm not sure what possible behaviors exist now, because some images successfully attach to multiple species from their tags: http://eol.org/data_objects/32778209

And others do not: https://flic.kr/p/dQZ84N I'm out of my depth here, because in the examples of this that I could find, the image appears on either all, or none of the taxa with which it is tagged. I can't explain this, so you probably want to sanity check and correct my explanation of this part...

Finally, some users in the BHL flickr group list several names in order to recognize them as synonyms. This is a problem because, if we include all of those names, (whether or not they get their image attached) EOL will no longer merge them as synonyms. Possibly nothing can be done about this, but if heroic measures are possible (running the taxa file of the resource against a global names service to detect synonyms, and add the synonym relationships?) that would be nice too.

Don't want much, do I?

eliagbayani commented 8 years ago

Hi Jen, Upon investigating the resource XML (544.xml), these three taxa you mentioned from https://flic.kr/p/dQZ84N: <dwc:ScientificName>Xanthornus flaviceps </dwc:ScientificName> <dwc:ScientificName>Xanthopsar flavus </dwc:ScientificName> <dwc:ScientificName>Xanthornus flavus </dwc:ScientificName> …all have that image as dataObject:

<dataObject> <dc:identifier>8430618176 </dc:identifier> <dataType>http://purl.org/dc/dcmitype/StillImage </dataType> <mimeType>image/jpeg </mimeType> <agent homepage="http://www.flickr.com/photos/61021753@N02" role="photographer">Biodiversity Heritage Library </agent> <dcterms:created>2013-01-30 12:35:51 </dcterms:created> <dc:title>n295_w1150 </dc:title> <dc:language>en </dc:language> <license>http://creativecommons.org/licenses/by/2.0/ </license> <dcterms:rightsHolder>Biodiversity Heritage Library </dcterms:rightsHolder> <dc:source>https://www.flickr.com/photos/biodivlibrary/8430618176/ </dc:source> <dc:description>The zoology of the voyage of H.M.S. Beagle .... London,Smith, Elder &amp; Co.,1838-. <a href="http://biodiversitylibrary.org/page/14062609" rel="nofollow">biodiversitylibrary.org/page/14062609 </a> </dc:description> <mediaURL>https://farm9.staticflickr.com/8475/8430618176_01939cd553_o.jpg </mediaURL> </dataObject>

So this image should be attached to those three taxa. I’m running the connector locally and will give you a copy of latest. Will also upload one in our harvest machine just to be sure harvesting gets it. Will update this ticket.

And regarding the proposed global names service to detect synonyms, that is doable if they have an API we can use. Anyway, I did some checking with our own API, just exploring. Will you be able to check if the combination of our Search and Pages API can act as name service to detect synonyms? e.g. Xanthopsar flavus

  1. http://eol.org/api/search/1.0.xml?q=Xanthopsar+flavus&page=1&exact=false&filter_by_taxon_concept_id=&filter_by_hierarchy_entry_id=&filter_by_string=&cache_ttl= Then get taxon_concept_ID 686274 as input for the Pages API:
  2. http://eol.org/api/pages/1.0.xml?batch=false&id=686274&images_per_page=0&images_page=0&videos_per_page=0&videos_page=0&sounds_per_page=0&sounds_page=0&maps_per_page=0&maps_page=0&texts_per_page=0&texts_page=0&iucn=false&subjects=overview&licenses=all&details=false&common_names=false&synonyms=true&references=false&taxonomy=false&vetted=0&cache_ttl=&language=en

Thanks, Eli

eliagbayani commented 8 years ago

Here is the latest copy: https://dl.dropboxusercontent.com/u/7597512/BioImages/544.xml.gz I've also now updated harvest machine with latest copy of the resource and set the resource to force-harvest. http://eol.org/content_partners/134/resources/544 Please take note also that the last harvest is in seemingly in an unfinished state (24Feb2016).

jhammock commented 8 years ago

Thanks, Eli! I hope the incomplete last harvest explains the missing objects; my theory sounds de-bunked, anyway.

Your synonym search method looks good to me. Is adding the synonym relationships to the taxa file, filtering to include only those where both the synonym and the accepted name were already present, an option? Since you are finding, and reproducing, these synonym relationships without the benefit of any higher taxonomy, I'd like to try add only the ones we need.

In the meantime, we'll see if a fresh harvest gives us better coverage of the images to their assorted taxa. Fingers crossed...

jhammock commented 8 years ago

Oh- or, if this is not much harder- could you use that synonym filter, per data object, to discard the copy of the data object on a synonym, if it's already there with the accepted name? Strictly speaking, that's the safest way to do it, but adding the synonym relationships would be a decent fallback.

eliagbayani commented 8 years ago

Hi @jhammock ,

So maybe what we can do is just the first one: ignore/remove the copy of the data object on a synonym, and only create the data object with the accepted name. But this also means that the correctness depends on the synonym detection tool we are going to use, which in this case will be our own EoL API (combination of Search and Pages API calls). Does this sound okay? Thanks.

jhammock commented 8 years ago

Yes, detecting the synonyms with our own services and creating data objects only for the accepted name if both are present sounds good. Adding the synonym relationship is riskier and doesn't add any information we didn't already have. Thanks!

eliagbayani commented 8 years ago

Hi @jhammock , I have now a resource where we’ve only assigned images to accepted names and not to synonyms. Will I now proceed to set this to force-harvest? Thanks.

jhammock commented 8 years ago

Cool! Just uploaded is good for now, thanks. We are pausing harvest while we do some names clean up, so we'll try this in a couple of weeks. I'll annotate it in the queue so we won't forget it's ready to try again.

jhammock commented 8 years ago

This is harvested at last! Multiple taxa in plates are showing up nicely. I have one example of a synonym that were not discarded, but I'm not sure if it was because of insufficient information: http://eol.org/data_objects/32776505 created one new taxon, http://eol.org/pages/45520365, although a search for that name on EOL turns up two taxon concepts for which it's an alternative name. (Evidently we had some merging to do already...)

eliagbayani commented 8 years ago

Hi @jhammock , Our script detected (via the API) that there is a “Falco chrysaetos” page in eol.org. That is, Species recognized by Inventaire National du Patrimoine Naturel

So it is not a synonym. Thus our system should have attached the image to that page and not create a new page.

Yes, a merging process should fix it. Thanks, Eli

jhammock commented 8 years ago

I see. it looks like any remaining messiness with this resource results from insufficient ancestry. A small sample suggests to me that there are lots of single taxon images that could be ameliorated this way. Sanity check: if the species name is known to EOL only as a synonym, adding some ancestry to it will still help, yes?

eg: binomial:"Fucus nodosus" (synonym) + order:Fucales + class:Phaeophyceae

That would have a better chance of mapping than binomial:"Fucus nodosus" alone?

If so, I will encourage the BHL folk accordingly. Thanks!

eliagbayani commented 8 years ago

Hi @jhammock, In your example: eg: binomial:"Fucus nodosus" (synonym) + order:Fucales + class:Phaeophyceae Right now, the image will still NOT be assigned to the binomial (synonym) but it will be assigned to the order:Fucales. Because the connector/script NOW will not assign objects to a name detected as synonym by our API. Meaning, detected synonyms are ignored. In your example we will have a <taxon> entry like this: — order = Fucales — class = Phaeophyceae

If we remove that code and not detect/check whether a name is a synonym or not. We will have a <taxon> entry with these info: — scientificname/binomial = Fucus nodosus — order = Fucales — class = Phaeophyceae Then I would assume that our harvesting should create a new page for it in EOL if we DON’T have a “Fucus nodosus” page yet. And if we do already have, the image should be assigned to that existing EOL page. And not create a new page. Here, the added ancestry info is most helpful.

*In the case for “Falco chrysaetos” where we already have a page for it. Even if there are NO other ancestry info included in their tags. Our system (I assume) should have assigned the image to the existing EOL page and not create a new one.

Thanks Eli

jhammock commented 8 years ago

Ah, ok. I think I just learned something. The filter checks for any name recognized as a synonym on EOL, not as a synonym of another name in the BHL resource?

I think that's still fine. I will encourage the ambitious BHL folk, if they want to help mapping, to go ahead and use the synonym, look up and add the accepted name also, and maybe a couple of ranks of ancestry- in the one organism in the image case, not the multiple organism case. Does that sound like it won't break anything?

eliagbayani commented 8 years ago

Yes @jhammock , that will work. Let them add not just the synonym but also lookup and add the accepted name also. And yes, if available, add a couple of ranks of ancestry. This will improve coverage. Thanks!