Closed santiagohermo closed 2 years ago
I opened this issue to start work here @jmshapir @miikapaal.
I think that the final version should look like a set of python scripts that go from PDF versions to sentence and paragraph level datasets in csv format. Do you see any other important decisions we need to make in light of this goal?
Also, could I have write access @jmshapir? I'll assign myself here and start working a new branch soon.
I opened this issue to start work here @jmshapir @miikapaal.
Thanks @santiagohermo!
I think that the final version should look like a set of python scripts that go from PDF versions to sentence and paragraph level datasets in csv format.
And I guess also counts at the level of the Laroplan-word (i.e., for each Laroplan, a count of each word that appears in it)?
Also, could I have write access @jmshapir? I'll assign myself here and start working a new branch soon.
Done!
I wanted to flag that I had some problems related to git lfs
when cloning the repo in Github Desktop @jmshapir. It was related to a specific file so I tried deleting it from origin/main
(commit https://github.com/JMSLab/LaroplanOCR/commit/2e5061dd867d695de9c8bb853c7bdd83e6b6a8ae) and then cloning again. This didn't solve the issue, and the same error happened on another file transferred via lfs
. The log for this second error is
20220120T153846.1084247.log. My guess is that files in JMSLab/Template
are not added to lfs
and then when cloning they cannot be downloaded. I wonder whether this is related to https://github.com/JMSLab/Template/issues/43
UPDATED: I managed to clone the repo using the procedure in https://github.com/git-lfs/git-lfs/issues/911#issuecomment-169998792, but this didn't help with getting the binaries. I will delete them anyway so I think it's ok. New binaries will be committed from scratch and thus should work fine.
For completeness, I note that when trying to get the binaries following the procedure in https://github.com/git-lfs/git-lfs/issues/911#issuecomment-169998792 I get the following message:
Note: I created branch issue1_firstversion
On the structure @jmshapir @miikapaal @dagese.
In Skills
, the structure of the laroplan analysis is as follows:
.
├── drive/
│ ├── raw_large/Laroplan/orig/ # Raw files
│ └── output/ # Stores output
├── source/
│ ├── derived_large/laroplan/
│ │ └── make_images.py # Transforms PDF to one PNG per page
│ ├── derived/laroplan/
│ │ ├── ocr.py # OCRs each PNG separately and outputs TXT
│ │ ├── run_ocr.py # Runs `make_images.py` and `ocr.py`
│ │ └── prepare.py # Makes paragraph and sentence level datasets (run via SCons)
│ └── analysis/laroplan/
│ ├── keywords.csv # Contains fluid and crystallized keywords
│ ├── bad_hits.csv # Contains cases of usage to be dropped
│ ├── analyze.py # Runs main analysis
│ ├── translate.py # Translates results
│ └── ... # Plotting scripts
└── output/ # Stores output
Assuming we don't use a datastore, I propose we organize this new repo as follows:
.
├── raw_data/
│ ├── docs/ # Documentation of files
│ └── orig/ # Original pdf files
├── derived/
│ ├── code/
│ │ ├──make_images.py # Similar to version in Skills
│ │ ├──ocr.py # Similar to version in Skills
│ │ └──clean.py # Portion of prepare.py that cleans text file
│ └── output/
│ ├──images/
│ │ └──.gitignore # Ignores png files
│ └──text/
└── analysis/
├── code/
│ ├──make_data.py # Portion of prepare.py that makes csv datasets
│ ├──analyze.py # Main analysis of fluid vs. crystallized
│ └──translate.py # Optional, translate output
├── output/
│ └──...
├──keywords.csv
└──bad_hits.csv
If we use a datastore maybe we can have there both the pdfs and images, and a script to make images out of pdfs with some instructions on how to run it.
What do you think @jmshapir @miikapaal @dagese?
On how to run scripts. I think SCons is a bit too complicated for a simple public repo. I see several alternatives:
run.py
script that everything, with a structure similar to the runall.do
scripts in Skills
make
file at the root of the repo.Any other ideas?
@santiagohermo thanks!
Regarding the git-lfs issues, it doesn't surprise me that when I spawned the new repo from the Template, git-lfs-tracked files were omitted, since these are treated differently than git-tracked files. (For example git-lfs files are, I think, omitted from the archive automatically created with each release.)
But, if this is preventing us from cloning new repositories that are based on the Template, that definitely seems like a bug we want to squash. Can you open an issue for yourself in the Template repo to figure out how we should approach this? Thanks!
(I will look next at the repository structure questions but there may be some delay before I can post again.)
Thanks @jmshapir! I opened https://github.com/JMSLab/Template/issues/45 for the cloning bug, we can discuss the issue there.
On the structure questions, that sounds good!
@santiagohermo thanks!
On the structure:
run.py
. It seems simple and should be flexible enough to do what we want but also OS-flexible.Thanks @jmshapir! I went through your comments, including the ones in the pdf. As response I revised the proposed structure for the repo, which you can see in the bottom of this comment.
Some thoughts on the pdf:
analyze.py
before was just counting words, I excluded the scripts that compute the cohort exposure and make the plots we use in the paper. To improve clarity I now renamed it to count_keywords.py
keywords.csv
and bad_hits.csv
. These files are not output, the first one contains the keywords to be counted (moved to analysis/code
), and the second one the appearances of keywords to be dropped (eliminated). I added output files to the tree, so that everything is clearer.On your remarks in the comment. I agree with avoiding a datastore and with having a run.py
, so I included them. And thanks for that reference to congress-legislators!
I'll start implementing and we can always adjust if you have more comments or we change our minds about something.
Suggested repository structure
LaroplanOCR/
├── run.py # Runs code
├── readme.md # Description of repo
├── raw/
│ ├── docs/ # Documentation of files
│ ├── orig/ # Original pdf files
│ └── readme.md # Instructions to get raw files
├── derived/
│ ├── code/
│ │ ├──make_images.py # Similar to version in Skills
│ │ ├──ocr.py # Similar to version in Skills
│ │ └──clean.py # Portion of prepare.py that cleans text file
│ └── output/
│ ├──images/
│ │ └──.gitignore # Ignores png files
│ └──text/
│ ├──lgrYYYY.txt # Output of OCR.py, one file per lgr
│ └──lgrYYYY_clean.txt # Output of clean.py, one file per lgr
└── analysis/
├── code/
│ ├──make_data.py # Portion of prepare.py that makes csv datasets
│ ├──count.py # Counts all word appearances
│ └──translate.py # Optional, translate output
└── output/
├──lgrYYYY_paragraphs.csv # Output of make_data.py, one file per lgr
├──lgrYYYY_sentences.csv # Output of make_data.py, one file per lgr
└──lgrYYYY_counts.csv # Output of count.py, one file per lgr
Thanks @santiagohermo! The updated tree looks great.
A couple questions:
lgrYYYY.txt
and lgrYYYY_clean.txt
?keywords.csv
? And we could rename count_keywords.py
-->count.py
?Thanks @jmshapir! Some replies to your questions.
- Can you explain the main differences between
lgrYYYY.txt
andlgrYYYY_clean.txt
?
Yes. The files lgrYYYY.txt
are the raw output of the OCR. The files lgrYYYY_clean.txt
are the output of the OCR after some cleaning (dropping page numbering, putting paragraphs that cross pages together, etc.). The reason they are separate is that the OCR step is the lengthiest one in the pipeline (you can see runtimes from Skills here), so I think it's better to keep the cleaning separate so that when we want to apply small changes in the cleaning step we don't have to run ~it~the OCR again.
Happy to try a different approach if you have something in mind.
- My thought is that for each LGR we can just precompute counts of each word that appears. Therefore I'm not sure we need a
keywords.csv
? And we could renamecount_keywords.py
-->count.py
?
So this output would have three columns, laroplan
, word
, and n
, right? Sounds much easier than what we do in Skills, so I like it! I updated the tree in https://github.com/JMSLab/LaroplanOCR/issues/1#issuecomment-1018890028 directly
Thanks @santiagohermo!
- Can you explain the main differences between
lgrYYYY.txt
andlgrYYYY_clean.txt
?Yes. The files
lgrYYYY.txt
are the raw output of the OCR. The fileslgrYYYY_clean.txt
are the output of the OCR after some cleaning (dropping page numbering, putting paragraphs that cross pages together, etc.). The reason they are separate is that the OCR step is the lengthiest one in the pipeline (you can see runtimes from Skills here), so I think it's better to keep the cleaning separate so that when we want to apply small changes in the cleaning step we don't have to run it again.Happy to try a different approach if you have something in mind.
Makes sense and sounds good!
- My thought is that for each LGR we can just precompute counts of each word that appears. Therefore I'm not sure we need a
keywords.csv
? And we could renamecount_keywords.py
-->count.py
?So this output would have three columns,
laroplan
,word
, andn
, right?
Yep, that would work!
For greater storage efficiency we could also have a separate counts
file for each Laroplan a la
lgrYYYY_counts.csv
with only two columns, word
and n
.
That way, the laroplan
info is encoded in the filename and doesn't need to be repeated across many rows.
What do you think?
Thanks @jmshapir!
As you probably noticed, I started to construct the pipeline in a new branch associated to this task. My plan for next steps is as follows:
run.py
, and send to reviewLet me know what you think!
@santiagohermo thanks and sounds good!
Something to think about: in a call today with @miikapaal we thought it may might be useful to have an illustration of how to use the data in the documentation. Maybe a simpler version of the analysis in the paper.
@miikapaal @santiagohermo sure, I like the idea of including a "vignette" to illustrate use of the data.
One approach could be to write an R (or python) script that:
I'd probably use keywords that are not related to the themes of our paper, but I'm sure we can find other interesting keywords to use in the vignette. (One example could be words related to technology.)
Thanks for a nice suggestion!
Thanks for the quick feedback @jmshapir! Your approach aligns with what I had in mind. I added it to the list of to-dos in that previous comment.
I'm now done with implementing a first version of the pipeline, as described in the second panel here. I'll thus move to PR.
Continues in its PR #2
Summary: In this issue we discussed a structure for the repo and implemented a first version of code.
Changes merged to main
in https://github.com/JMSLab/LaroplanOCR/commit/712ebec830670d37c00438de0df0afba9297e480
In this issue we will implement a first version of the code for the OCR and analysis of the Läroplaner.
As part of the implementation, we will decide on the following: