CAIDA / catalog-data

Repo which holds some panda solutions and papers
3 stars 6 forks source link

add routeviews data #553

Open bhuffaker opened 1 year ago

bhuffaker commented 1 year ago

Create a script that will add the papers from https://www.routeviews.org/routeviews/index.php/papers/ and link them to a Routeviews datasets.

trdavidt commented 1 year ago

https://www.routeviews.org/routeviews/index.php/papers/

How do I scrape the data in the table on the next page? There doesn't seem to be a url for the "next" button. (edit: resolved, all the table data was already in the html for this page but it was hidden)

bhuffaker commented 1 year ago

All the data is already on the page, the Javascript is hiding it when it presents the table.

On Mar 14, 2023, at 10:31 AM, David Tran @.***> wrote:

https://www.routeviews.org/routeviews/index.php/papers/

How do I scrape the data in the table on the next page? There doesn't seem to be a url for the "next" button.

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/553#issuecomment-1468531702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7PE7IYQKZOYQFURXXLW4CTPNANCNFSM6AAAAAAU5ECFXM. You are receiving this because you authored the thread.

trdavidt commented 1 year ago

Awaiting review, I added one of the placeholder paper objects in the commit out of 1009 total papers. The remaining paper placeholders should look the same. Do the fields look okay, or should there be more?

trdavidt commented 1 year ago

Also, there was a small issue with the data/routeviews-data.txt generated by my script: some papers have the authors' names separated by comma instead of semicolon (which most papers use). I corrected this by hand.

bhuffaker commented 1 year ago

I think we should be adding to the YAML external Paper file.

The data-papers.yaml file is manually maintained, so you should put these automated files into a separate file. Put the YAML output into data/data-papers-routeviews.yaml. Then update the Makefile and scripts/externallinks_placeholder.py so it can read input multiple files. Make sure that it checks for duplicates. It should read the manual file first, and produce a warning if the duplicate is in the automated file.

trdavidt commented 1 year ago

What is the externallinks_placeholder scripts used for?

bhuffaker commented 1 year ago

Makefile:

add here: https://github.com/CAIDA/catalog-data/blob/master/Makefile#L69

trdavidt commented 1 year ago

This script = downloads and converts https://www.routeviews.org/routeviews/index.php/papers/ into the same format of data/data-papers.yaml.

Here is my data/data-papers-routeviews.yaml generated by this script. What should the format be for the MARKER field in the yaml? It seems like the markers in in /data/data-papers.yaml are formatted like YYYY_last_f_<some substring of URL>.

bhuffaker commented 1 year ago

I have merged your 554 branch into this your 553. Please work in this branch. I have updated the Makeflie to have the routeviews target.

external:
    python3 scripts/download_url.py -O ${EXTERNAL_ROUTEVIEWS_HTML} ${EXTERNAL_ROUTEVIEWS_URL}
ifneq ("$(wildcard ${EXTERNAL_ROUTEVIEWS_HTML})", :"")
    python3 scripts/routeviews-parse.py -O ${EXTERNAL_ROUTEVIEWS_FILE} ${EXTERNAL_ROUTEVIEWS_HTML}
    python3 scripts/externallinks_placeholder.py ${EXTERNAL_MANUAL_FILE} ${EXTERNAL_ROUTEVIEWS_FILE}
else
    python3 scripts/externallinks_placeholder.py ${EXTERNAL_MANUAL_FILE}
endif

This depends on three scripts:

notes:

Let me know if you have any questions.