Open bhuffaker opened 1 year ago
https://www.routeviews.org/routeviews/index.php/papers/
How do I scrape the data in the table on the next page? There doesn't seem to be a url for the "next" button. (edit: resolved, all the table data was already in the html for this page but it was hidden)
All the data is already on the page, the Javascript is hiding it when it presents the table.
On Mar 14, 2023, at 10:31 AM, David Tran @.***> wrote:
https://www.routeviews.org/routeviews/index.php/papers/
How do I scrape the data in the table on the next page? There doesn't seem to be a url for the "next" button.
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/553#issuecomment-1468531702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7PE7IYQKZOYQFURXXLW4CTPNANCNFSM6AAAAAAU5ECFXM. You are receiving this because you authored the thread.
Awaiting review, I added one of the placeholder paper objects in the commit out of 1009 total papers. The remaining paper placeholders should look the same. Do the fields look okay, or should there be more?
Also, there was a small issue with the data/routeviews-data.txt generated by my script: some papers have the authors' names separated by comma instead of semicolon (which most papers use). I corrected this by hand.
I think we should be adding to the YAML external Paper file.
The data-papers.yaml file is manually maintained, so you should put these automated files into a separate file. Put the YAML output into data/data-papers-routeviews.yaml. Then update the Makefile and scripts/externallinks_placeholder.py so it can read input multiple files. Make sure that it checks for duplicates. It should read the manual file first, and produce a warning if the duplicate is in the automated file.
What is the externallinks_placeholder scripts used for?
Makefile:
add the URL to the stop of the file
EXTERNAL_ROUTEVIEWS_URL=https://www.routeviews.org/routeviews/index.php/papers/
EXTERNAL_ROUTEVIEWS_FILE=data/data/data-papers.yaml
add scripts/externallinks_routeviews.py to the Makefile. This script = downloads and converts https://www.routeviews.org/routeviews/index.php/papers/ into the same format of data/data-papers.yaml. It should only download if https://www.routeviews.org/routeviews/index.php/papers is older then 5 days.
then expanded externallinks_placeholder so it can take multiple files as input. It will parse both data/data-papers.yaml and data/data-papers-routeviews.yaml, since they are the same format, to create sources/paper/(paper).json files for each paper in those files.
add here: https://github.com/CAIDA/catalog-data/blob/master/Makefile#L69
This script = downloads and converts https://www.routeviews.org/routeviews/index.php/papers/ into the same format of data/data-papers.yaml.
Here is my data/data-papers-routeviews.yaml generated by this script. What should the format be for the MARKER
field in the yaml? It seems like the markers in in /data/data-papers.yaml are formatted like YYYY_last_f_<some substring of URL>
.
I have merged your 554 branch into this your 553. Please work in this branch. I have updated the Makeflie to have the routeviews target.
external:
python3 scripts/download_url.py -O ${EXTERNAL_ROUTEVIEWS_HTML} ${EXTERNAL_ROUTEVIEWS_URL}
ifneq ("$(wildcard ${EXTERNAL_ROUTEVIEWS_HTML})", :"")
python3 scripts/routeviews-parse.py -O ${EXTERNAL_ROUTEVIEWS_FILE} ${EXTERNAL_ROUTEVIEWS_HTML}
python3 scripts/externallinks_placeholder.py ${EXTERNAL_MANUAL_FILE} ${EXTERNAL_ROUTEVIEWS_FILE}
else
python3 scripts/externallinks_placeholder.py ${EXTERNAL_MANUAL_FILE}
endif
This depends on three scripts:
notes:
Let me know if you have any questions.
Create a script that will add the papers from https://www.routeviews.org/routeviews/index.php/papers/ and link them to a Routeviews datasets.