JMSLab / LaroplanOCR

Swedish primary school curricula (Läroplaner för grundskolan) in digital format.
MIT License
2 stars 0 forks source link

Define structure and implement first version of code #1

Closed santiagohermo closed 2 years ago

santiagohermo commented 2 years ago

In this issue we will implement a first version of the code for the OCR and analysis of the Läroplaner.

As part of the implementation, we will decide on the following:

santiagohermo commented 2 years ago

I opened this issue to start work here @jmshapir @miikapaal.

I think that the final version should look like a set of python scripts that go from PDF versions to sentence and paragraph level datasets in csv format. Do you see any other important decisions we need to make in light of this goal?

Also, could I have write access @jmshapir? I'll assign myself here and start working a new branch soon.

jmshapir commented 2 years ago

I opened this issue to start work here @jmshapir @miikapaal.

Thanks @santiagohermo!

I think that the final version should look like a set of python scripts that go from PDF versions to sentence and paragraph level datasets in csv format.

And I guess also counts at the level of the Laroplan-word (i.e., for each Laroplan, a count of each word that appears in it)?

Also, could I have write access @jmshapir? I'll assign myself here and start working a new branch soon.

Done!

santiagohermo commented 2 years ago

I wanted to flag that I had some problems related to git lfs when cloning the repo in Github Desktop @jmshapir. It was related to a specific file so I tried deleting it from origin/main (commit https://github.com/JMSLab/LaroplanOCR/commit/2e5061dd867d695de9c8bb853c7bdd83e6b6a8ae) and then cloning again. This didn't solve the issue, and the same error happened on another file transferred via lfs. The log for this second error is 20220120T153846.1084247.log. My guess is that files in JMSLab/Template are not added to lfs and then when cloning they cannot be downloaded. I wonder whether this is related to https://github.com/JMSLab/Template/issues/43

UPDATED: I managed to clone the repo using the procedure in https://github.com/git-lfs/git-lfs/issues/911#issuecomment-169998792, but this didn't help with getting the binaries. I will delete them anyway so I think it's ok. New binaries will be committed from scratch and thus should work fine.

For completeness, I note that when trying to get the binaries following the procedure in https://github.com/git-lfs/git-lfs/issues/911#issuecomment-169998792 I get the following message:

image

santiagohermo commented 2 years ago

Note: I created branch issue1_firstversion

santiagohermo commented 2 years ago

On the structure @jmshapir @miikapaal @dagese.

In Skills, the structure of the laroplan analysis is as follows:

    .
    ├── drive/
    │   ├── raw_large/Laroplan/orig/    # Raw files
    │   └── output/                     # Stores output
    ├── source/
    │   ├── derived_large/laroplan/
    │   │   └── make_images.py          # Transforms PDF to one PNG per page
    │   ├── derived/laroplan/
    │   │   ├── ocr.py                  # OCRs each PNG separately and outputs TXT
    │   │   ├── run_ocr.py              # Runs `make_images.py` and `ocr.py`
    │   │   └── prepare.py              # Makes paragraph and sentence level datasets (run via SCons)
    │   └── analysis/laroplan/
    │       ├── keywords.csv            # Contains fluid and crystallized keywords
    │       ├── bad_hits.csv            # Contains cases of usage to be dropped
    │       ├── analyze.py              # Runs main analysis
    │       ├── translate.py            # Translates results
    │       └── ...                     # Plotting scripts
    └── output/                         # Stores output

Assuming we don't use a datastore, I propose we organize this new repo as follows:

    .
    ├── raw_data/
    │   ├── docs/                  # Documentation of files
    │   └── orig/                  # Original pdf files
    ├── derived/
    │   ├── code/
    │   │   ├──make_images.py      # Similar to version in Skills
    │   │   ├──ocr.py              # Similar to version in Skills
    │   │   └──clean.py            # Portion of prepare.py that cleans text file
    │   └── output/
    │       ├──images/
    │       │   └──.gitignore      # Ignores png files
    │       └──text/
    └── analysis/
        ├── code/
        │   ├──make_data.py        # Portion of prepare.py that makes csv datasets
        │   ├──analyze.py          # Main analysis of fluid vs. crystallized
        │   └──translate.py        # Optional, translate output
        ├── output/
        │   └──...
        ├──keywords.csv
        └──bad_hits.csv

If we use a datastore maybe we can have there both the pdfs and images, and a script to make images out of pdfs with some instructions on how to run it.

What do you think @jmshapir @miikapaal @dagese?


On how to run scripts. I think SCons is a bit too complicated for a simple public repo. I see several alternatives:

  1. A run.py script that everything, with a structure similar to the runall.do scripts in Skills
  2. Separate bash files for windows and mac
  3. GNU make is a simpler version of scons in which you need to declare target and source files. I used it in my Amazon internship, and requires installation + a single make file at the root of the repo.
  4. Both 1 and 3.

Any other ideas?

jmshapir commented 2 years ago

@santiagohermo thanks!

Regarding the git-lfs issues, it doesn't surprise me that when I spawned the new repo from the Template, git-lfs-tracked files were omitted, since these are treated differently than git-tracked files. (For example git-lfs files are, I think, omitted from the archive automatically created with each release.)

But, if this is preventing us from cloning new repositories that are based on the Template, that definitely seems like a bug we want to squash. Can you open an issue for yourself in the Template repo to figure out how we should approach this? Thanks!

(I will look next at the repository structure questions but there may be some delay before I can post again.)

santiagohermo commented 2 years ago

Thanks @jmshapir! I opened https://github.com/JMSLab/Template/issues/45 for the cloning bug, we can discuss the issue there.

On the structure questions, that sounds good!

jmshapir commented 2 years ago

@santiagohermo thanks!

On the structure:

santiagohermo commented 2 years ago

Thanks @jmshapir! I went through your comments, including the ones in the pdf. As response I revised the proposed structure for the repo, which you can see in the bottom of this comment.

Some thoughts on the pdf:

On your remarks in the comment. I agree with avoiding a datastore and with having a run.py, so I included them. And thanks for that reference to congress-legislators!

I'll start implementing and we can always adjust if you have more comments or we change our minds about something.


Suggested repository structure

LaroplanOCR/
    ├── run.py                            # Runs code
    ├── readme.md                         # Description of repo
    ├── raw/
    │   ├── docs/                         # Documentation of files
    │   ├── orig/                         # Original pdf files
    │   └── readme.md                     # Instructions to get raw files
    ├── derived/
    │   ├── code/
    │   │   ├──make_images.py             # Similar to version in Skills
    │   │   ├──ocr.py                     # Similar to version in Skills
    │   │   └──clean.py                   # Portion of prepare.py that cleans text file
    │   └── output/
    │       ├──images/
    │       │   └──.gitignore             # Ignores png files
    │       └──text/
    │           ├──lgrYYYY.txt            # Output of OCR.py, one file per lgr
    │           └──lgrYYYY_clean.txt      # Output of clean.py, one file per lgr
    └── analysis/
        ├── code/
        │   ├──make_data.py               # Portion of prepare.py that makes csv datasets
        │   ├──count.py                   # Counts all word appearances
        │   └──translate.py               # Optional, translate output
        └── output/
            ├──lgrYYYY_paragraphs.csv     # Output of make_data.py, one file per lgr
            ├──lgrYYYY_sentences.csv      # Output of make_data.py, one file per lgr
            └──lgrYYYY_counts.csv          # Output of count.py, one file per lgr
jmshapir commented 2 years ago

Thanks @santiagohermo! The updated tree looks great.

A couple questions:

  1. Can you explain the main differences between lgrYYYY.txt and lgrYYYY_clean.txt?
  2. My thought is that for each LGR we can just precompute counts of each word that appears. Therefore I'm not sure we need a keywords.csv? And we could rename count_keywords.py-->count.py?
santiagohermo commented 2 years ago

Thanks @jmshapir! Some replies to your questions.

  1. Can you explain the main differences between lgrYYYY.txt and lgrYYYY_clean.txt?

Yes. The files lgrYYYY.txt are the raw output of the OCR. The files lgrYYYY_clean.txt are the output of the OCR after some cleaning (dropping page numbering, putting paragraphs that cross pages together, etc.). The reason they are separate is that the OCR step is the lengthiest one in the pipeline (you can see runtimes from Skills here), so I think it's better to keep the cleaning separate so that when we want to apply small changes in the cleaning step we don't have to run ~it~the OCR again.

Happy to try a different approach if you have something in mind.

  1. My thought is that for each LGR we can just precompute counts of each word that appears. Therefore I'm not sure we need a keywords.csv? And we could rename count_keywords.py-->count.py?

So this output would have three columns, laroplan, word, and n, right? Sounds much easier than what we do in Skills, so I like it! I updated the tree in https://github.com/JMSLab/LaroplanOCR/issues/1#issuecomment-1018890028 directly

jmshapir commented 2 years ago

Thanks @santiagohermo!

  1. Can you explain the main differences between lgrYYYY.txt and lgrYYYY_clean.txt?

Yes. The files lgrYYYY.txt are the raw output of the OCR. The files lgrYYYY_clean.txt are the output of the OCR after some cleaning (dropping page numbering, putting paragraphs that cross pages together, etc.). The reason they are separate is that the OCR step is the lengthiest one in the pipeline (you can see runtimes from Skills here), so I think it's better to keep the cleaning separate so that when we want to apply small changes in the cleaning step we don't have to run it again.

Happy to try a different approach if you have something in mind.

Makes sense and sounds good!

  1. My thought is that for each LGR we can just precompute counts of each word that appears. Therefore I'm not sure we need a keywords.csv? And we could rename count_keywords.py-->count.py?

So this output would have three columns, laroplan, word, and n, right?

Yep, that would work!

For greater storage efficiency we could also have a separate counts file for each Laroplan a la

lgrYYYY_counts.csv

with only two columns, word and n.

That way, the laroplan info is encoded in the filename and doesn't need to be repeated across many rows.

What do you think?

santiagohermo commented 2 years ago

Thanks @jmshapir!


As you probably noticed, I started to construct the pipeline in a new branch associated to this task. My plan for next steps is as follows:

Let me know what you think!

jmshapir commented 2 years ago

@santiagohermo thanks and sounds good!

santiagohermo commented 2 years ago

Something to think about: in a call today with @miikapaal we thought it may might be useful to have an illustration of how to use the data in the documentation. Maybe a simpler version of the analysis in the paper.

jmshapir commented 2 years ago

@miikapaal @santiagohermo sure, I like the idea of including a "vignette" to illustrate use of the data.

One approach could be to write an R (or python) script that:

I'd probably use keywords that are not related to the themes of our paper, but I'm sure we can find other interesting keywords to use in the vignette. (One example could be words related to technology.)

Thanks for a nice suggestion!

santiagohermo commented 2 years ago

Thanks for the quick feedback @jmshapir! Your approach aligns with what I had in mind. I added it to the list of to-dos in that previous comment.

I'm now done with implementing a first version of the pipeline, as described in the second panel here. I'll thus move to PR.

santiagohermo commented 2 years ago

Continues in its PR #2

santiagohermo commented 2 years ago

Summary: In this issue we discussed a structure for the repo and implemented a first version of code.

Changes merged to main in https://github.com/JMSLab/LaroplanOCR/commit/712ebec830670d37c00438de0df0afba9297e480