Open golnazads opened 1 month ago
Isn't that how things work currently? For reference files with a default extension, the extension determines what parser to use. In all other cases, the journal and volume determine what parser to use. Individual references do not carry the required information to determine what parser to use.
Yes, I added this issue because there was an ongoing discussion about addressing errors in reference files and whether to correct them directly in the file or within the pipeline. I wasn't sure if a conclusion had been reached. My reasoning has always been to manually fix a smaller number of problematic reference files because I combined the parsing logic of various formats into a unified parser to minimize the number of parsers, making maintenance easier. For example, some text references format multi-line references with a tab at the beginning of each line, while others use a tab for the subsequent lines, starting the first line at the beginning. A single parser can only handle one of these formats. I implemented the format that correctly parses the majority of reference files and anticipated manually fixing the few outliers.
There was a suggestion to add more parsers, but I am not in favor of this approach as it would increase complexity and maintenance overhead—issues that the classic system already struggles with and that we want to avoid here.
@ehenneken I am including two unresolved issues that were discussed during the meeting for further review and action. Both relate to nested reference strings. The first issue involves references separated by semicolons. As detailed in the Feedback Document - Reference Pipeline, the semicolon cannot be used to break up references by the pipeline because, in some instances, it is used to separate the title and journal within a reference string. Therefore, I have refrained from splitting the references based on semicolons. I have documented all the instances where manual correction is required for these, they are not that many. The second issue concerns nested author replacement. Multiple underscores or hyphens indicate author substitution for multiple references in one line. The pipeline performs the substitution only if both the first and subsequent reference strings include the year after the list of authors and the multiple underscores or hyphens, respectively. If the year is not present, the pipeline is unable to replace the authors as it lacks the necessary anchor for substitution. @aaccomazzi
What are the journals where we see semicolons separating titles and journals? The formats may be sufficiently different to allow some branching for the logic associated with reference processing. For instance, the semicolons separating multiple references are common in APS (physics) journals, but not in astronomy.
I don’t remember, and unfortunately, I did not document it while working on the pipeline. I only knew they existed, likely from arXiv, because I had to ensure that the reference service could parse them correctly and separate the title from the journal when tokenizing. I primarily worked with arXiv while implementing reference service.
I would rather not go down this route and create another parser, as my understanding, along with the documented instances, is that there aren’t that many cases, and they can be fixed manually. However, if you insist on having multiple parsers and returning to the way things were done in classic, please provide the bibstems of the files to redirect to a new parser and accept semicolons as multi-references.
Please see error#2b in text parsers verification report for specific example of semicolon issue.
Thanks, the report shows as you say that there are instances where semicolons are improperly used (they should be commas). I don't have the data to back up the following statement, but this is my current guess: the examples you have shown in the google doc are the outliers that need to be fixed by hand, because most physics journals tend to consistently use semicolons to separate references, so the data "fix" should be to edit those references where the semicolon was really supposed to be a comma rather than the other way around.
If I'm correct, it may be that an additional non-manual solution is possible: knowing the source journal for a reference (which we know since we use it to select the proper handler), we could "turn on" semicolon-splitting based on it. The logic behind this is that for the major physics journals we know that semicolons are used to separate references, whereas in all other cases they don't.
This has been on my mind since I reviewed my notes yesterday. I had written:
That is the reason I have marked these to be separated manually. Or, once I have tabulated them all, code could be implemented to distinguish between them.
So, I decided to follow my own advice, lol. To implement it, I would rather not go the journal-based route, but instead try to determine which semicolon separates a title/journal and which one separates two references. One approach is to break the reference string down and identify the author list and the year in all segments using regular expressions. If each segment contains both an author list and a year, then they are individual references. It won’t be 100%, but I think it will cover 97% of cases.
I shall try to find more instances and experiment with this approach next week after finishing the task I am currently working on.
The import of references from classic files into the DB should just focus on the file parsing details and be decoupled from the individual reference format.