TreeBASE / treebase

Source code for TreeBASE web application and database
http://www.treebase.org
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

OAI records contain all subjects in a single field #199

Closed ghost closed 8 years ago

ghost commented 14 years ago

In the OAI records, each <dc:subject> field contains many keywords, separated by commas, like this:

<dc:subject> Ascomycota, Pezizomycotina, Dothideomyceta, fungal evolution, lichens, multigene phylogeny, phylogenomics, plant pathogens, saprobes, Tree of Life </dc:subject>

It is best practice to put each keyword into a separate <dc:subject> field. This allows harvesting systems (like Dryad) to accurately separate the keywords, and not worry about keywords that may contain commas.

Reported by: ryscher

ghost commented 14 years ago

Original comment by: hlapp

ghost commented 14 years ago

See https://www.nescent.org/wg\_dryad/TreeBASE\_OAI\_Provider for examples of URLs that return these OAI records.

The record schema is at http://datadryad.org/profile/v3/dryad.xsd Both data formats mentioned above formally conform to the schema, but the best practice is to have several <dc:subject> elements, one per term.

Original comment by: vgapeyev-nescent

ghost commented 14 years ago

A few URLs that return records exhibiting the problem: http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s1908 http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s10013 http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s1122 http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s994 Note that some separate keywords with ',' while others with ';'

Original comment by: vgapeyev-nescent

ghost commented 14 years ago

Thanks for reporting this bug. We'll look into it as soon as possible.

Original comment by: vgapeyev-nescent

ghost commented 14 years ago

This is a request for clarification.

Treebase UI offers a single field to enter keywords, text from which is stored in a single field in the database. From the data in treebase-dev I see that users used ',' or ';' to separate multiple keywords.

Here is what I can do: Get Kevin's keyword-splitting code and place it on Treebase side, modifying if necessary to work with both ';' and ','. This would not work nicely if the user has a fancy to use comma-containing keywords separated by semicolons, or the other way around.

Please confirm that this is what is needed.

Original comment by: vgapeyev-nescent

ghost commented 14 years ago

This was my concern as well with my workaround Dryad code -- that there may be repositories for whom the comma is significant and not a delimiter. It seems that if TreeBASE wants to store all these in one field it might be good to prescribe that users use a semicolon as a delimiter (perhaps doing a db cleanup on records that are currently using a comma). Then the OAI code could rely on the semicolon as the split to break the string into separate metadata elements for output via OAI-PMH.

My code was very minimal for this just using a StringTokenizer(value, ";,") cf. line 785 in http://code.google.com/p/dryad/source/browse/trunk/dryad/dspace/modules/api/src/main/java/org/dspace/harvest/OAIHarvester.java

Original comment by: ksclarke

ghost commented 14 years ago

Your bug has been resolved. Thanks for the report.

Original comment by: vgapeyev-nescent

ghost commented 14 years ago

Fixed in SVN 760: Treebase citation.keyword field is now split on both ',' and ';', with the results going into separate <dc:subject> elements.

'in press' values will show up as <dc:subject>in press</dc:subject> -- this is awaiting Bill's data cleaning on production.

Original comment by: vgapeyev-nescent

ghost commented 14 years ago

Original comment by: vgapeyev-nescent