Closed ghost closed 8 years ago
Original comment by: hlapp
See https://www.nescent.org/wg\_dryad/TreeBASE\_OAI\_Provider for examples of URLs that return these OAI records.
The record schema is at http://datadryad.org/profile/v3/dryad.xsd Both data formats mentioned above formally conform to the schema, but the best practice is to have several <dc:subject> elements, one per term.
Original comment by: vgapeyev-nescent
A few URLs that return records exhibiting the problem: http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s1908 http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s10013 http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s1122 http://127.0.0.1:8080/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai\_dc&identifier=TB:s994 Note that some separate keywords with ',' while others with ';'
Original comment by: vgapeyev-nescent
Thanks for reporting this bug. We'll look into it as soon as possible.
Original comment by: vgapeyev-nescent
This is a request for clarification.
Treebase UI offers a single field to enter keywords, text from which is stored in a single field in the database. From the data in treebase-dev I see that users used ',' or ';' to separate multiple keywords.
Here is what I can do: Get Kevin's keyword-splitting code and place it on Treebase side, modifying if necessary to work with both ';' and ','. This would not work nicely if the user has a fancy to use comma-containing keywords separated by semicolons, or the other way around.
Please confirm that this is what is needed.
Original comment by: vgapeyev-nescent
This was my concern as well with my workaround Dryad code -- that there may be repositories for whom the comma is significant and not a delimiter. It seems that if TreeBASE wants to store all these in one field it might be good to prescribe that users use a semicolon as a delimiter (perhaps doing a db cleanup on records that are currently using a comma). Then the OAI code could rely on the semicolon as the split to break the string into separate metadata elements for output via OAI-PMH.
My code was very minimal for this just using a StringTokenizer(value, ";,") cf. line 785 in http://code.google.com/p/dryad/source/browse/trunk/dryad/dspace/modules/api/src/main/java/org/dspace/harvest/OAIHarvester.java
Original comment by: ksclarke
Your bug has been resolved. Thanks for the report.
Original comment by: vgapeyev-nescent
Fixed in SVN 760: Treebase citation.keyword field is now split on both ',' and ';', with the results going into separate <dc:subject> elements.
'in press' values will show up as <dc:subject>in press</dc:subject> -- this is awaiting Bill's data cleaning on production.
Original comment by: vgapeyev-nescent
Original comment by: vgapeyev-nescent
In the OAI records, each <dc:subject> field contains many keywords, separated by commas, like this:
<dc:subject> Ascomycota, Pezizomycotina, Dothideomyceta, fungal evolution, lichens, multigene phylogeny, phylogenomics, plant pathogens, saprobes, Tree of Life </dc:subject>
It is best practice to put each keyword into a separate <dc:subject> field. This allows harvesting systems (like Dryad) to accurately separate the keywords, and not worry about keywords that may contain commas.
Reported by: ryscher