This is an attempt to extract data from the old PCLOB site, scraped via HTTrack, and then populate the new site with it.
Given our current timeline/funding this seems like the most expedient option we have at the moment. In the future, the extracted data can be modified to fit a better information architecture for PCLOB's needs.
In particular, this PR does the following:
Events and press now contains a merge of the legacy site's Newsroom and Meetings & Events sections. The page links out to detail pages for each news/event item. Where possible, permalinks for these detail pages have been preserved from the legacy site. The individual files for these collections are in _newsroom and _events.
Semiannual reports, Official correspondence, and Federal Register notices are now populated.
The file _data/legacy-library.yaml contains structured data that represents the contents of the legacy Library page. Some of it is pulled to generate the aforementioned report pages, while other data is currently unused (but we can use it later if we want). Eventually we might want to move individual parts of this data into separate files so it's easier for the PCLOB folks to mantain.
I had some trouble extracting data properly to populate FOIA reports so I'll do it manually in #40.
The legacy-site directory contains the legacy site's HTML files and a JavaScript extractor program used to generate the aforementioned data and collections. We probably shouldn't hand-edit the generated files for a little while, until we're convinced that the extractor is doing its job properly (so we can easily modify and re-run it without clobbering any edits). Once it's stable, though, we should delete the directory. Instructions for running the extractor are in legacy-site/README.md.
It moves /assets/files/ to /library/. While it does muddy up our repo a bit, I'm doing it primarily to ensure that any existing permalinks to PCLOB's PDF files aren't broken.
This is an attempt to extract data from the old PCLOB site, scraped via HTTrack, and then populate the new site with it.
Given our current timeline/funding this seems like the most expedient option we have at the moment. In the future, the extracted data can be modified to fit a better information architecture for PCLOB's needs.
In particular, this PR does the following:
Events and press now contains a merge of the legacy site's Newsroom and Meetings & Events sections. The page links out to detail pages for each news/event item. Where possible, permalinks for these detail pages have been preserved from the legacy site. The individual files for these collections are in
_newsroom
and_events
.Semiannual reports, Official correspondence, and Federal Register notices are now populated.
The file
_data/legacy-library.yaml
contains structured data that represents the contents of the legacy Library page. Some of it is pulled to generate the aforementioned report pages, while other data is currently unused (but we can use it later if we want). Eventually we might want to move individual parts of this data into separate files so it's easier for the PCLOB folks to mantain.I had some trouble extracting data properly to populate FOIA reports so I'll do it manually in #40.
The
legacy-site
directory contains the legacy site's HTML files and a JavaScript extractor program used to generate the aforementioned data and collections. We probably shouldn't hand-edit the generated files for a little while, until we're convinced that the extractor is doing its job properly (so we can easily modify and re-run it without clobbering any edits). Once it's stable, though, we should delete the directory. Instructions for running the extractor are inlegacy-site/README.md
.It moves
/assets/files/
to/library/
. While it does muddy up our repo a bit, I'm doing it primarily to ensure that any existing permalinks to PCLOB's PDF files aren't broken.