Finding a way to compare and pick out similar pdfs

angelchen7 commented 2 months ago

Summary

The Delta Restoration group has gotten pretty much almost all of their proposals, but they noticed that there may be some duplication with the proposals. For example, 4 proposals were submitted for the same spatial area, but they are essentially the same project. So to make sure they're not double-counted, we need to find a way to compare the proposals and see how similar they are and pick out the duplicates. Think about how close are these texts, how many words do they share?

Starting Tasks

[x] Explore and see if there are any useful R packages for comparing similar texts
[x] Write a script to showcase an example of comparing some pdfs
[x] Open PR
[x] Report to the group
[ ] Revise as needed

Useful links

angelchen7 commented 2 months ago

Waiting for push access...

I wrote an initial script, compare_proposals.R, that: 1) takes a range of pages from every proposal pdf in a folder (qpdf::pdf_subset()) 2) extracts the text (pdftools::pdf_text()) 3) compiles the texts into a corpus (quanteda::corpus()) 4) compares the texts to each other (LexisNexisTools::lnt_similarity()) by using a generalized Levenshtein distance metric 5) and tidies the results into a neat table to be exported locally

However I'm still waiting on Pascale to give me push access to the Restoration repo, so I'll just wait a bit for that before reporting to the group.

angelchen7 commented 2 months ago

Can't get push access!

Pascale told me that she no longer has admin privileges since she's not at the DSC anymore. She thinks IT must have removed her access.

For now, I opened a PR (here) and I'm hoping that at least 1 other person on the team can merge it 😅

All my work is in compare_proposals.R and I've emailed everyone in Restoration with a lengthy email update letting them know what I did and how the workflow works.

NCEAS / learning-hub-organization