Open andrewhead opened 3 years ago
@andrewhead just want to make sure I follow the plan...
Is that right?
If so, and related your question about analogous code changes in scholarphi-pipeline, what code is it that you would like scholarphi-pipeline to be calling? Right now, it works off the chi-2021-demo branch. We could have it continue to do so after 1. (I think the one change needed on the scholarphi-pipeline side for this, apart from bumping the image, would be to stop running 2% of papers through expecting them to get more than citations). We could also have it run off the chi-2021-demo branch after 2. At that point perhaps we could reintroduce the part where 2% of papers try to get more than just citations. (Both these change descriptions assume that the way we call scholarphi to get just citations output doesn't change.) If we wanted to run scholarphi-pipeline off main, we should probably figure out what other changes that exist in chi-2021-demo we want to bring in too.
@andrewhead just want to make sure I follow the plan... Is that right?
Yes, that's right! Though I was thinking of first merging both steps 1 and 2 into an experimental branch (e.g., "chi-2021-demo-refactor") one at a time before merging that branch into "chi-2021-demo".
That way, we would not have to turn off the proccessing of the 2% of papers with more than citations, but rather modify the way that symbol and definition commands get called from 'scholarphi-pipeline,' once the merge of the experimental branch into "chi-2021-demo" was ready.
what code is it that you would like scholarphi-pipeline to be calling?
I am thinking we could initially have it call the "chi-2021-demo" code, then once we successfully merge into "main," we could have scholarphi-pipeline run off of "main." What do you think of that idea?
Yes, that's right! Though I was thinking of first merging both steps 1 and 2 into an experimental branch (e.g., "chi-2021-demo-refactor") one at a time before merging that branch into "chi-2021-demo". That way, we would not have to turn off the proccessing of the 2% of papers with more than citations, but rather modify the way that symbol and definition commands get called from 'scholarphi-pipeline,' once the merge of the experimental branch into "chi-2021-demo" was ready.
That sounds good.
I am thinking we could initially have it call the "chi-2021-demo" code, then once we successfully merge into "main," we could have scholarphi-pipeline run off of "main." What do you think of that idea?
I think it would be great to start running off main again! Do you think we would bring over all of the changes that exist in chi-2021-demo at that point, or do you think we would have to be at all selective? Asking because I'm remembering a talk I had with @kyleclo after the CHI demo where we weren't sure what exactly should get pulled into main from that branch... Though perhaps the changes you're making with this and the next PR would address that? Adding a note here to reflect what I think was a helpful piece of that conversation: we don't have to necessarily deploy anything that gets merged to scholarphi's main and affects data-processing to scholarphi-pipeline - we can limit it to certain commits (possibly tagged as releases).
Do you think we would bring over all of the changes that exist in chi-2021-demo at that point, or do you think we would have to be at all selective?
Hm... I am not sure yet. Honestly, I cannot remember all of the changes between the 'main' branch and 'chi-2021-demo' 😬 I think we will want to roll in all of the changes from the data-processing
directory, though I think the ui
and api
directories should also be checked to see if any code is present in those branches that we do not want to be deployed.
Adding a note here to reflect what I think was a helpful piece of that conversation: we don't have to necessarily deploy anything that gets merged to scholarphi's main and affects data-processing to scholarphi-pipeline - we can limit it to certain commits (possibly tagged as releases).
This sounds like a good idea to me! I can imagine this making the deployed pipeline less prone to error because it would require one of us to explicitly indicate in the code which versions of the code had been evaluated as capable of running correctly. Is that the idea?
(Andrew's personal notes for upcoming tasks as part of this refactor:)
--commands
, --start
, and --end
I guess I'm wondering if it's the case that the format of the same entity with the new code will be different enough that the existing types wouldn't be particularly useful anymore?
This is a great point! When I could, I left in the instances of types that I think will be useful to the new code. There were a bunch of types that did not make the cut, as they no longer have an analog in the new code (for instance, entities that have LaTeX 'start' and 'end' character offsets, where the new code will be mostly agnostic to LateX entity positions).
I still think that bumping the image that scholarphi-pipeline uses will probably be a more or less manual thing for the foreseeable future...
This sounds fine to me!
And just to be clear, I'm talking about changes in data-processing making it into the scholarphi-pipeline environment, not api or. UI changes making it into the relevant Skiff environments
Thank you for the clarification! I think we are on the same page there.
This refactor is part of a larger effort to simplify the paper processing code. An alternative (faster, higher-accuracy) pipeline has been developed for detecting sentences, symbols, and definitions. This PR removes the old code for detecting these entities, to make it clear that this original code for detecting those entities is no longer maintained, and to make the remaining code much smaller, and hopefully therefore easier read and maintain.
A summary of changes follows:
entities
directories for terms, sentences, sentences-pdf, definitions, and glossary_terms (this accounts for the majority of deleted files)common.types
that solely existed as data structures for the above entitiescommon.file_utils
EquationExtractor
, etc.)tests/mathml-fragments/*
)Following these changes, the full complement of unit tests still pass:
After running
pytest .
, the following appears:I also re-ran extraction of citations to ensure that it still runs:
This is the contents of the generated
output/citations.jsonl
file:Before this branch is merged into 'main', PRs will be submitted to incorporate the new code for detecting sentences, symbols, and definitions. An initial version of this code is implemented in the repository https://github.com/allenai/pdftexalign.
This PR should be discussed with @ca16 to discuss places where changes in this code base will require analogous changes in the scholarphi pipeline container code.