Open angelchen7 opened 2 months ago
I wrote an initial script, compare_proposals.R, that:
1) takes a range of pages from every proposal pdf in a folder (qpdf::pdf_subset()
)
2) extracts the text (pdftools::pdf_text()
)
3) compiles the texts into a corpus (quanteda::corpus()
)
4) compares the texts to each other (LexisNexisTools::lnt_similarity()
) by using a generalized Levenshtein distance metric
5) and tidies the results into a neat table to be exported locally
However I'm still waiting on Pascale to give me push access to the Restoration repo, so I'll just wait a bit for that before reporting to the group.
Pascale told me that she no longer has admin privileges since she's not at the DSC anymore. She thinks IT must have removed her access.
For now, I opened a PR (here) and I'm hoping that at least 1 other person on the team can merge it 😅
All my work is in compare_proposals.R and I've emailed everyone in Restoration with a lengthy email update letting them know what I did and how the workflow works.
Summary
The Delta Restoration group has gotten pretty much almost all of their proposals, but they noticed that there may be some duplication with the proposals. For example, 4 proposals were submitted for the same spatial area, but they are essentially the same project. So to make sure they're not double-counted, we need to find a way to compare the proposals and see how similar they are and pick out the duplicates. Think about how close are these texts, how many words do they share?
Starting Tasks
Useful links