CAIDA / catalog-data

Repo which holds some panda solutions and papers
3 stars 6 forks source link

Related objects shouldn't be on the list of Access links unless they're tightly related #565

Closed amacaida closed 1 year ago

amacaida commented 1 year ago

https://catalog.caida.org/paper/2022_scalable_network_event_detection_poster

In PubDB the related catalog objects are software:corsaro media:2021_flowtuples_iv_dust paper:2014_passive_ip_space_usage_estimation paper:2012_analysis_slash_zero paper:2017_millions_targets_under_attack paper:2021_spatial_temporal_analysis paper:2004_tr_2004_04

It seems the media:2021_flowtuples_iv_dust appears under the "Links to access" listing when it shouldn't, as its only loosely related to the paper. The paper mentions the slidedeck in a footnote, but they're not tightly related.

image

There are times when we want related presentations to show up on the list of papers, usually if they're identically named or if PubDB entry explicitly has a link called "Related Presentation".

bhuffaker commented 1 year ago

scripts/pubdb_placeholder.py needs a way to tell which linked pdf are slides for the paper and which are just linked. Are you suggesting a new parameter?

     "links": [
        {
           "from": "PubDBlinkId:1842",
           "label": "DOI",
           "to": "https://doi.org/10.1145/3517745.3563015"
        },
        {
           "from": "PubDBlinkId:1841",
           "label": "PDF",
           "to": "https://www.caida.org/catalog/papers/2022_scalable_network_event_detection_poster/scalable_network_event_detection_poster.pdf"
        }
     ],
amacaida commented 1 year ago

If there's a label "PDF" in "links" and its a paper entry, it is the PDF of the paper. If it's a slideset, it is the PDF of the slideset. I have a label I use called "Related Paper" that would point to a Catalog entry for a slideset entry, or "Related Slideset" that would point to the slideset for paper entry.

So not a new parameter.

My issue is that, for the example above, there are only two links (DOI and PDF), but Catalog has decided that the flowtuples slideset is special enough to be in the Links to Access, when really it should only appear as a Related Object

image

I'd want this entry to not have the flowtuple slideset in "Links to Access"

bhuffaker commented 1 year ago

@amacaida how will I identify the first set as slides for the dataset and not the second from the pubdb dump?

amacaida commented 1 year ago

For the paper we're talking about, PDF should be in the Access link list because its linked with the label PDF.
The second access link (2021_flowtuples_iv_dust) should not be in the Access list because its only a Related Object. As an aside, none of the papers in the Related Objects wrongly ended up in the Links to Access list, but for some reason the flowtuples presentation did.

image

amacaida commented 1 year ago

In this case, again PDF should be in the access list since its explicitly listed with the label PDF This related catalog object paper:2022_mind_your_manrs , actually should be in the access list. How could we guess that? 🤔 the ids are similar? i.e. 2022_mind_your_manrs_imc vs the linked 2022_mind_your_manrs of the paper. Can there be some way to look for a partial match like that? Or, the titles are identical between the presentation and the paper. They're both Mind Your MANRS: Measuring the MANRS Ecosystem

Is there a way to say "if related catalog object's id matches mostly or if the related catalog object's title is identical to this object's title, then add to access list"?

image

amacaida commented 1 year ago

In this straightforward last case, there's an explicit link labeled Related Presentation and data supplement in the PubDB links so they also belong on the Links to Access list: image

bhuffaker commented 1 year ago

So everything in Links, except DOI, is added to access. Nothing in linkedObjects.

"name": "A Scalable Network Event Detection Framework for Darknet Traffic"
   "DOI": "https://doi.org/10.1145/3517745.3563015"
   "PDF": "https://www.caida.org/catalog/papers/2022_scalable_network_event_detection_poster/scalable_network_event_detection_poster.pdf"

"name": "Mind Your MANRS: Measuring the MANRS Ecosystem"
   "DOI": "https://doi.org/10.1145/3517745.3561419"
   "PDF": "https://www.caida.org/catalog/papers/2022_mind_your_manrs/mind_your_manrs.pdf"

"name": "DynamIPs: Analyzing address assignment practices in IPv4 and IPv6"
   "PDF": "https://www.caida.org/catalog/papers/2020_dynamips/dynamips.pdf"
   "DOI": "https://doi.org/10.1145/3386367.3431314"
   "Data Supplement": "https://www.caida.org/catalog/papers/2020_dynamips/supplemental"
   "Related Presentation": "https://catalog.caida.org/media/2020_dynamips_conext"

"name": "A Scalable Network Event Detection Framework for Darknet Traffic"
   "DOI": "https://doi.org/10.1145/3517745.3563015"
   "PDF": "https://www.caida.org/catalog/papers/2022_scalable_network_event_detection_poster/scalable_network_event_detection_poster.pdf"
bhuffaker commented 1 year ago

b568c34bd40892f0d6985e577f7b91dc96aa7d5c

amacaida commented 1 year ago

So everything in Links, except DOI, is added to access. Nothing in linkedObjects.

Incorrect; linked objects only if the linked object's ID or title (or whatever) matches closely to the current object's ID or title, per my second screenshot with the yellow checkmark (the Mind Your MANRS example)

bhuffaker commented 1 year ago

So everything in Links, except DOI, is added to access. Nothing in linkedObjects.

Incorrect; linked objects only if the linked object's ID or title (or whatever) matches closely to the current object's ID or title, per my second screenshot with the yellow checkmark (the Mind Your MANRS example)

where would you put the cut? would you use precentage or number of cuts? https://github.com/CAIDA/catalog-data/blob/master/analysis/title-edit-distance.md

amacaida commented 1 year ago

Looks like first I'd try to use 100% match (0% ratio?) since even 1% off can give the wrong thing. Better to have it not pair it with an actual related link than to have it wrongly pair it with a close-but-not-quite link

bhuffaker commented 1 year ago

Ok. @amacaida do you have an REU with pubdb access that can find and false negatives? "One-way Traffic Monitoring with iatmon" and "One way Traffic Monitoring with iatmon"

bhuffaker commented 1 year ago

attempt 2 to resolve with this pull request https://github.com/CAIDA/catalog-data/pull/626