CAIDA / catalog-data

Repo which holds some panda solutions and papers
4 stars 6 forks source link

553 add routeviews data #639

Closed trdavidt closed 1 year ago

trdavidt commented 1 year ago

Closes #553. Adding routeviews data works with updated Makefile and uses three scripts. scripts/externallinks_placeholder.py checks for and merges duplicate papers in two input yaml's.

bhuffaker commented 1 year ago

I get the following error messages when I try to make it:

python3 scripts/externallinks_placeholder.py data/data-papers.yaml data/data-papers-routeviews.yaml
    loading data/data-papers.yaml
    loading data/data-papers-routeviews.yaml
    found 69 duplicates
unparseable "" in "Citadels in cyberspace"
unparseable "" in "CAIDA Macroscopic IP Topology Data Kit (ITDK) #0204 provided to the Network Modeling and Simulation (NMS) community under DARPA grant N66001-01-1-8909"
unparseable "" in "The Internet Under Crisis Conditions: Learning from September 11"
unparseable "" in "ISMA Winter 2000 Workshop - Final Report"
make[2]: *** No rule to make target `routerviews', needed by `run'.  Stop.
make[1]: *** [fast] Error 2
make: *** [readable] Error 2
trdavidt commented 1 year ago

I believe this is a typo in the Makefile. The routerviews target is no longer needed and has been removed. Should I also fix the "unparseable" errors above? They are a result of missing authors/information for these papers on routeviews.org where I scraped from to generate the routeviews yaml.

bhuffaker commented 1 year ago

Yes please.

On Jun 21, 2023, at 6:13 PM, David Tran @.***> wrote:

I believe this is a typo in the Makefile. The routerviews target is no longer needed and has been removed. Should I also fix the "unparseable" errors above? They are a result of missing authors/information for these papers on routeviews.org where I scraped from to generate the routeviews yaml.

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/pull/639#issuecomment-1601882631, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7PBY4RPEL32MN4SAPTXMOL4HANCNFSM6AAAAAAZLEFWQU. You are receiving this because you commented.

trdavidt commented 1 year ago

It looks like there's no way to fix the remaining unparseable errors without editing the data-papers-routeviews.yaml that gets generated by scripts/external_routeviews_parse.py.

For example, the first unparseable paper has authors Rawat; Madhur; Chakravarty; Sambuddho on the routeviews website. This paper looks like it has 4 authors, and there is no way to systematically recognize that it actually has two authors without searching online and manually changing the yaml. Similarly, another has author Unknown. However, the routeviews yaml should not be committed (#553). What should I do?

bhuffaker commented 1 year ago

Make a list of papers that are broken. We will email routeviews and have them fix it on their end.

trdavidt commented 1 year ago
python3 scripts/externallinks_placeholder.py data/data-papers.yaml data/data-papers-routeviews.yaml
    loading data/data-papers.yaml
    loading data/data-papers-routeviews.yaml
    found 69 duplicates
unparseable "" in "Citadels in cyberspace"
unparseable "" in "CAIDA Macroscopic IP Topology Data Kit (ITDK) #0204 provided to the Network Modeling and Simulation (NMS) community under DARPA grant N66001-01-1-8909"
unparseable "" in "The Internet Under Crisis Conditions: Learning from September 11"
unparseable "" in "ISMA Winter 2000 Workshop - Final Report"

Of these unparseable papers, these two should be corrected if possible: (1) "Citadels in cyberspace": badly formatted authors (see prev comment) (2) "The Internet Under Crisis Conditions: Learning from September 11": author is "Unknown"

Both of the remaining unparseable papers has author "CAIDA". I fixed the issues with "CAIDA" author by fixing my script that generates the routeviews yaml. The external placeholder script is able to handle single-name authors just fine.

Happy Fourth of July!

bhuffaker commented 1 year ago

The caida paper should match against papers generated from pubdb. You will need to change the Makefile, so that it generates the papers from pudb before your code is called. It will then need to check the files generated in sources/papers and not ignore duplicates.

trdavidt commented 1 year ago

What do you mean by not ignore duplicates? What should I do if there is a duplicate paper? Would this be similar to merging duplicates like we discussed before (take union of keys)?

bhuffaker commented 1 year ago

Actually, just skip those papers for now.

trdavidt commented 1 year ago

Ok, the "CAIDA" author papers should be skipped now. There should only be two unparseable papers.

(Edit: it is not showing up here, but I did push to the 553 branch )

bhuffaker commented 1 year ago

What are those papers?

On Jul 4, 2023, at 10:15 PM, David Tran @.***> wrote:

Ok, the "CAIDA" author papers should be skipped now. There should only be two unparseable papers.

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/pull/639#issuecomment-1621040274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7OSV6V7O66JHSPDQL3XOTZ7HANCNFSM6AAAAAAZLEFWQU. You are receiving this because you commented.

trdavidt commented 1 year ago

They are:

bhuffaker commented 1 year ago

do you mean you don't know what to map them too?

trdavidt commented 1 year ago

I think I am misunderstanding your previous comments. This is how I interpreted our discussion:

With commit A: The two papers are included in the routeviews yaml. Then, they will be parsed correctly by the placeholder script and placeholder objects will be created for them.

With commit B: These papers are effectively ignored.

Is there something else missing that should done for this issue?

bhuffaker commented 1 year ago

Please remove debugging error messages:

Matching up Papers with media/presentations with the same name
tag:slides
tag:slides
tag:slides
tag:slides
tag:slides

Resolve error messages:

python3 scripts/externallinks_placeholder.py data/data-papers.yaml data/data-papers-routeviews.yaml
    loading data/data-papers.yaml
    loading data/data-papers-routeviews.yaml
    found 69 duplicates
unparseable "" in "Citadels in cyberspace"
unparseable "" in "The Internet Under Crisis Conditions: Learning from September 11"