CAIDA / catalog-data

Repo which holds some panda solutions and papers
3 stars 6 forks source link

Find all objects that have "moved", re-create it, and add redirect information to their data #467

Closed bhuffaker closed 9 months ago

bhuffaker commented 2 years ago

We want to find a list of objects which have been deleted from catalog-data, but have not been created some other location in the catalog. Write a scripts/find-removed-ids.py.

{
    "id": "software:dzdb_api",
    "name": "DZDB API",
    "deprecated": {
        "description": "DZDB API has been conslidated into DZDB dataset.",
        "url": "this is a test URL, will be removed",
        "id": "dataset:dzdb",
        "autoredirect": true
    },
    "visibility": "hidden",
}
VdotR commented 2 years ago

Hi Bradley, for the very first part, do I need to care about files that are deleted under recipes as they're not json objects? For example, 2 commits that deleted files were mine and that's simply because the files aren't useful anymore.

I think git log --diff-filter=D only shows us commits where we have deleted a file? I went on stackoverflow and found this interesting option git log --diff-filter=D --summary that shows us the names of files that have been deleted? I also didn't understand "parse out the list of object is soureces/(type)/(name) => id=(type):(name)" . Thank you.

bhuffaker commented 2 years ago

On Aug 27, 2022, at 4:15 PM, VdotR @.***> wrote:

Hi Bradley, for the very first part, do I need to care about files that are deleted under recipes as they're not json objects?

no

For example, 2 commits that deleted files were mine and that's simply because the files aren't useful anymore.

You should be comparing against master. These files were exposed publicly. I think git log --diff-filter=D only shows us commits where we have deleted a file? I went on stackoverflow and found this interesting option git log --diff-filter=D --summary that shows us the names of files that have been deleted?

That's fine I also didn't understand "parse out the list of object is soureces/(type)/(name) => id=(type):(name)" . Thank you.

You can infer the object’s id from the object’s path on disk ie sources/(type)/(name) where id is (type):(name).
So if you have a file call sources/dataset/cats.json the object you needs to create a depercated file for will have the id dataset:cats.

You should be able to checkout the last version with the file and get it’s origial name etc.

Bradleyu

VdotR commented 2 years ago

Thanks Bradley. Is https://api.catalog.caida.org/ the CAIDA catalog api url?

bhuffaker commented 2 years ago

yes

On Aug 29, 2022, at 3:24 PM, VdotR @.***> wrote:

Thanks Bradley. Is https://api.catalog.caida.org/ https://api.catalog.caida.org/ the CAIDA catalog api url?

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1230935079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7PE4JE32EDSLPJGXWDV3U2BTANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

Hi Bradley, I just got the printing current ids script to work.

I notice that every python file in the scripts directory follows the same format which contains a very long header and sections such as global variables. Should I follow similar format? Is there a guide on how to contribute to scripts like what I had when writing recipes?

bhuffaker commented 2 years ago

We don't have a style guide. But all else equal, it is better for the various scripts to look similar.

VdotR commented 2 years ago

Hi Bradley, I'm now going through each of the commits that deleted files. I didn't quite understand what "Find the commit when that file was removed and see what it was" meant. Do you mean that we should find out why the file was deleted based on the commit messages? There seems to be not enough info to infer. I used git diff <HEAD commit> <commit that deleted file> --name-status

VdotR commented 2 years ago

When I look into the catalog-data repo I only see a very small number of objects. I think in a recent commit by David he moved caida datasets, but I don't know where he moved these to.

bhuffaker commented 2 years ago

They were moved to catalog-data-caida, but this doesn't matter as long as they are still in the catalog. Only ids that are no longer in the catalog, which you can find from the API, need a redirect.

" Do you mean that we should find out why the file was deleted based on the commit messages?" no, you only need to get the object's name and see if there exists something simialir that it could have been mapped to. You want to go to the commit just before the object was deleted so you can see it's name and description, then look in the catalog for something similar, and if they look like they are the same create a redirect.

We will look at the example software:dzdb_api, which was conslidated into dataset:dzdb.

If you can't find any reasonable redirect make a list and post the list of ids to this issue.

VdotR commented 2 years ago

Thank you Bradley, where do I create the redirect? Under catalog-data-caida? Also, I don't quite understand what "example it" means, seems like I just need to create redirects for the files that have been deleted but are still in the API?

I am moving on to the last step (creating redirects) and I just want ot make sure everything I did so far is correct.

  1. Using git log --diff-filter=D --summary I found all files that have been deleted. I then used grep with keyword ".json" to get all the objects, and then I used a few more unix commands to transform these lines to the type:name format
  2. Using find-removed-ids.py I wrote, I was able to get all objects in the catalog-data api.
  3. I sorted files I created in steps 1 and 2 in order to use the comm command. From comm -12 <deleted objects> <objects in api> I was able to get all the objects that have been deleted but are still in the api which need redirect.
  4. (The last step I'm going to do) Create deprecated objects in catalog-data-caida that redirects to current objects.

Sorry for all the inconvenices.

VdotR commented 2 years ago

Hi Bradley, can you give me a list of the types of objects that I need to care about (especially person since there're lots of them)? I know that I don't need to worry about recipes, but what else? Below are all folders under sources:

image

Also, I found that there are ~1300 objects (including every type) that need redirect: around 30 datasets, 500 media, 400 papers, 350 persons. Is that expected?

bhuffaker commented 2 years ago

yes ignore persons

On Sep 2, 2022, at 9:24 PM, VdotR @.***> wrote:

There are also many deleted "person" objects, do I ignore those as well?

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1236045611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7IZWVXV77MH4XWVPDTV4LHJVANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

Thanks Bradley. I just want to confirm (so that I don't waste time doing the wrong things) that it is expected to have ~1000 removed objects (excluding recipe and person) in catalog-data? And the types I need to care about are dataset, media, paper, and software?

bhuffaker commented 2 years ago

Is that 1000 before or after you checked the ids against api.catalog.caida.org?

On Sep 5, 2022, at 3:46 PM, VdotR @.***> wrote:

Thanks Bradley. I just want to confirm (so that I don't waste time doing the wrong things) that it is expected to have ~1000 removed objects (excluding recipe and person) in catalog-data? And the types I need to care about are dataset, media, paper, and software?

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237509806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7LMQ237KA6R6XVRVBTV4ZZ3HANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

After. There are ~1500 removed ids before comparing against api.catalog.caida.org

bhuffaker commented 2 years ago

can you make a list and send it to me

On Sep 5, 2022, at 3:51 PM, VdotR @.***> wrote:

After. There are ~1500 removed ids before comparing against api.catalog.caida.org

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237511002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7IMTSHBFZ5FWID4LOLV4Z2NJANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

I think I found the problem. For many of the old files, the id is different as the filename. Take this paper as an example: https://github.com/CAIDA/catalog-data/blob/7aeb7ad9bfe72d94a762d246f83d1a347218a444/sources/paper/2012_the_4th_workshop_on_active_internet_measurements_aims4_report.json

Json is named as "2012_the_4th_workshop_on_active_internet_measurements_aims4_report.json" but the id is named as "2012_aims4_report", which I found in api.catalog.caida.org - I think most of them falls here. git log --diff-filter=D will only give us the file name, is there a possible way to get all the removed ids instead of all the removed filenames?

VdotR commented 2 years ago

Solution I have in mind is that I write another python script under scripts that extracts all deleted ids, but I will need to run many unix commands under python. Is there a good way to do this? For example, how can I store the output of git log --diff-filter=D as a string?

bhuffaker commented 2 years ago

I want to do a sanity check. Please send me the list first.

On Sep 5, 2022, at 4:05 PM, VdotR @.***> wrote:

Solution I have in mind is that I write another python script under scripts that extracts all deleted ids, but I will need to run many unix commands under python. Is there a good way to do this? For example, how can I store the output of git log --diff-filter=D as a string?

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237514990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7KWXHCTQRNQ3NJVGEDV4Z4EBANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

OK. need_redirect.txt This is all ids(filenames) that have been removed but not in catalog. currobjs.txt This is all ids in catalog data api. git_del_rec.txt This is all deleted files (formatted)

bhuffaker commented 2 years ago

You should be able to compare the ‘deleted’ files to the files currently in catalog-data or catalog-data-caida.

Clone catalog-data-caida (https://github.com/CAIDA/catalog-data-caida) as a subdirectory of catalog-data (ie catalog-data/catalog-data-caida). Then you can write you script to search for the filename (starting in catalog-data directory) in sources and catalog-data-caida/sources, in addition to checking for the ids in api.catalog.caida.org.

On Sep 5, 2022, at 4:16 PM, Bradley Huffaker @.***> wrote:

On Sep 5, 2022, at 4:09 PM, VdotR @. @.>> wrote:

OK. need_redirect.txt https://github.com/CAIDA/catalog-data/files/9492256/need_redirect.txt This is all ids(filenames) that have been removed but not in catalog. currobjs.txt https://github.com/CAIDA/catalog-data/files/9492259/currobjs.txt This is all ids in catalog data api. git_del_rec.txt https://github.com/CAIDA/catalog-data/files/9492261/git_del_rec.txt This is all deleted files (formatted)

thi file also appears to have ids: dataset:dzdb dataset:ipv6_allpref_topology dataset:passive_2017_pcap dataset:passive_realtime dataset:telescope_ddos

Can you send me the file+path sources/dataset/dzdb.json

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237516326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7LI4UVIRJ6KIJVVETTV4Z4STANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

Thanks! So it is necessary to extract the id from the file given the file name. Are there good tools to run unix commands in python and store the output as variable? For example, I would want a string that stores output of git log --diff-filter=D

bhuffaker commented 2 years ago

catalog-data-caida is a different directory. So a file moved from catalog-data to catalog-data-caida, will be deleted from catalog-data, but the file will still be on catalog-data-caida

On Sep 5, 2022, at 4:41 PM, VdotR @.***> wrote:

Thanks! But I think git log --diff-filter=D shows all removals in the ENTIRE directory?

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237527942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7NNWANSBQLRUYEUAG3V42ALXANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 2 years ago

Hi Bradley, I rewrote the entire script. It now gets the ids removed instead of filename, and it automates all the process except for creating json files. According to my script there are 177 files that were removed but not in catalog. (excluding catalog-data-caida)

bhuffaker commented 2 years ago

send me the list of 177 filenames

VdotR commented 2 years ago

Here you go: ids_to_remove.txt

VdotR commented 2 years ago

Are the object ids case sensitive? For example, the object with id "media:1999_Crisp9912" was detected by my script as deleted but not in catalog data api. However, I found this https://catalog.caida.org/media/1999_crisp9912 in catalog data api, the only difference is that one of the "c" in the former id is capitalized.

VdotR commented 2 years ago

I find most of the ids generated fall to similar cases as above: contents are exactly the same yet the ids have different capitalization and usage of hyphen/underscore

bhuffaker commented 2 years ago

Are the object ids case sensitive? For example, the object with id "media:1999_Crisp9912" was detected by my script as deleted but not in catalog data api. However, I found this https://catalog.caida.org/media/1999_crisp9912 in catalog data api, the only difference is that one of the "c" in the former id is capitalized.

No. Ids are not case sensitive. Use https://github.com/CAIDA/catalog-data/blob/master/scripts/lib/utils.py#L10 to convert your strings into ids. this will make everything lowercase, replace hyphen with underscore, etc

I don't expect a lot of ids to be missing.

VdotR commented 2 years ago

I finished redirects.csv for catalog-data and is now working on creating redirects.csv for catalog-data-caida. Below are two objects in catalog-data that I didn't find a good redirect of:

First entry is id, second entry is filename, third entry is last commit before removing the file.

bhuffaker commented 2 years ago

I finished redirects.csv for catalog-data and is now working on creating redirects.csv for catalog-data-caida. Below are two objects in catalog-data that I didn't find a good redirect of:

  • [X] ['paper:2016_new_approaches_old_challenges-tr', 'sources/paper/2016_new_approaches_to_old_challenges_with_as_traceroute.json', '7aeb7ad9bfe72d94a762d246f83d1a347218a444']

Let this one go.

  • [ ] ['dataset:caida_internet_traffic', 'sources/dataset/caida_internet_traffic.json', 'ed008a6301a183f247a7593c956e2fe800f2ce86']

@eyulaeva1 do you now what this data should be redirected to?

eyulaeva1 commented 2 years ago

Looked at the caida_internet_traffic.json it has link to paper:2019_hypersparse_neural_network_analysis and is actually passive_metadata

bhuffaker commented 1 year ago

On Sep 5, 2022, at 4:09 PM, VdotR @.***> wrote:

OK. need_redirect.txt https://github.com/CAIDA/catalog-data/files/9492256/need_redirect.txt This is all ids(filenames) that have been removed but not in catalog. currobjs.txt https://github.com/CAIDA/catalog-data/files/9492259/currobjs.txt This is all ids in catalog data api. git_del_rec.txt https://github.com/CAIDA/catalog-data/files/9492261/git_del_rec.txt This is all deleted files (formatted)

thi file also appears to have ids: dataset:dzdb dataset:ipv6_allpref_topology dataset:passive_2017_pcap dataset:passive_realtime dataset:telescope_ddos

Can you send me the file+path sources/dataset/dzdb.json

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237516326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7LI4UVIRJ6KIJVVETTV4Z4STANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 1 year ago

sources/dataset/dzdb.json sources/dataset/ipv6_allpref_topology.json sources/dataset/passive_2017_pcap.json sources/dataset/passive_realtime.json sources/dataset/telescope_ddos.json

I can't send json files here so I'll send it via mattermost

bhuffaker commented 1 year ago

What are they files and why are they not in your banch?

On Oct 11, 2022, at 2:55 PM, VdotR @.***> wrote:

sources/dataset/dzdb.json sources/dataset/ipv6_allpref_topology.json sources/dataset/passive_2017_pcap.json sources/dataset/passive_realtime.json sources/dataset/telescope_ddos.json

I can't send json files here so I'll send it via mattermost

— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1275318396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7JNTCXWSCHYN7W3LU3WCXO3NANCNFSM57XTJPWQ. You are receiving this because you authored the thread.

VdotR commented 1 year ago

The files are here: https://github.com/CAIDA/catalog-data/pull/511/files

bhuffaker commented 1 year ago
# Stores the old_id to new_id redirects
redirect_id_id = {}
# Store all the children of new_id
redireect_id_children = {}

# This is the list of nods that will now need to be
# redirected to new_id.  This checks if old_id is already
# the root of an existing tree
# A -> B
# B -> C
# We need to redirect not only B to C, but also all the nodes
# nodes that pointed to B
children = [old_id]
if old_id in redirect_id_children:
    children.extend(redireect_id_children[old_id])

# check if new_id is a child of old_id
if new_id in children:
    utils.error(filename, f"[{linenum}] loop found between {old_id} and {new_id}")
    continue

# If old_id has children, forget them they belong to new_id
if old_id in redirect_id_children:
    del redirect_id_children[old_id]

# If new_id doesn't yet have children, add a set
if new_id not in redirect_id_chlidren:
    redirect_id_children[new_id] = set()

# Add in all the new children
for child in children
    redirect_id_chlidren[new_id],add(child)
jes089 commented 1 year ago

@VdotR what's your current status of this task?

VdotR commented 1 year ago

@jes089 I worked on this task last quarter and did a naive implementation which required reading the redirects csv file twice. Bradley asked me to complete the task with one read only but somehow I couldn't get it to work. I pushed all the code I had and is waiting for Bradley's response. Right now I'm working on other tasks.

bhuffaker commented 1 year ago

https://github.com/CAIDA/catalog-data/pull/511/files this pull request only includes JSON files. Not your script changes. Please commit your changes to the branch and let us know which branch it is.

VdotR commented 1 year ago

@bhuffaker Here's the branch: https://github.com/CAIDA/catalog-data/tree/118-routeviews_prefix2as

Here's "data-build.py" in the branch: https://github.com/CAIDA/catalog-data/blob/118-routeviews_prefix2as/scripts/data-build.py

bhuffaker commented 1 year ago

I changed my mind. You don't need to merge the code. @jsun can you make sure that the code does what it is suppose to?

VdotR commented 1 year ago

Finished writing an initial version of scripts/find-removed-ids.py which converts file names to ids. However, the code produced way more ids that are not currently in the catalog than we expected. After searching some of the ids we realized that many objects that were considered not in the catalog are actually in the catalog. For example, the paper with id "paper:2007_two_days_in_the_life_of_the_dns_anycast_root_servers", which is wasn't found in the catalog, is in the catalog with the id "2007_dns_anycast".

There are two reasons that might have caused this: First we assumed that objects with name example.json will have the id "category:example", but that may not be the case for some early documents. Second the ids and names of the objects might have changed at some point.

Therefore, we decided to use the edit distance algorithm on top of comparing ids with current objects in the catalog. Basically, for deleted ids that are not in the catalog, we will calculate its edit distance with the NAME of objects in the catalog. For each missing id, if we found an object in the catalog with name similar to the missing id then we'll consider the object to be in the catalog. To improve the performance I'll edit the id (e.g. delete year, make underlines and dashes to be whitespaces) before calculating edit distance. Working on the edit distance part now.

VdotR commented 1 year ago

@bhuffaker

Here are the final deleted objects which I couldn't find anywhere:

media:2020_artemis_uknof45 Solutions:what-is-an-asn paper:2016_new_approaches_to_old_challenges_with_as_traceroute dataset:skitter_router_level_topology_measurements paper:1999_experimental_study_of_internet_stability_and_backbone_failures dataset:passive_equinix_nyc dataset:passive_generic dataset:anycast_dataset dataset:passive_statistics software:as_organization_api dataset:ipv6_allpref_topology_dataset media:2005_iinternet_applications_drivers_of_growth_20052015 dataset:skitter_macroscopic_topology_data dataset:euro_ix_ixp_service_matrix

VdotR commented 1 year ago

@Phileodontist @eyulaeva1 Do you know any thing about these datasets? These are objects that existed in the catalog-data repository before but were deleted at some point.

dataset:skitter_router_level_topology_measurements dataset:passive_equinix_nyc dataset:passive_generic dataset:anycast_dataset dataset:passive_statistics software:as_organization_api dataset:ipv6_allpref_topology_dataset dataset:skitter_macroscopic_topology_data dataset:euro_ix_ixp_service_matrix

Phileodontist commented 1 year ago

@VdotR Most of the datasets either changed in name or got consolidated. Most if not all of these should be in the catalog-data-caida repository.

Certain: dataset:passive_statistics → dataset:passive_metadata software:as_organization_api → dataset:as_organization dataset:ipv6_allpref_topology_dataset → dataset:ipv6_allpref_topology dataset:skitter_macroscopic_topology_data → dataset:skitter_itdk dataset:skitter_router_level_topology_measurements → dataset:skitter_router_adjacencies

Uncertain: dataset:passive_equinix_nyc → dataset:passive_2018_pcap & dataset:passive_2019_pcap dataset:passive_generic → dataset:passive_realtime

VdotR commented 1 year ago

@Phileodontist Thanks for the response. So are there anything I can do about the uncertain datasets?

Phileodontist commented 1 year ago

@bhuffaker You have any recollection of the follow?

dataset:passive_equinix_nyc → dataset:passive_2018_pcap & dataset:passive_2019_pcap
dataset:passive_generic → dataset:passive_realtime
bhuffaker commented 12 months ago

If we can’t find matching anchor files. Let them go to 404.

On Sep 21, 2023, at 12:15 PM, Philip Leo Pascual @.***> wrote:

@bhuffaker https://github.com/bhuffaker You have any recollection of the follow?

dataset:passive_equinix_nyc → dataset:passive_2018_pcap & dataset:passive_2019_pcap dataset:passive_generic → dataset:passive_realtime — Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1730156940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7MZZT4224LBFEOD5ALX3SG6LANCNFSM57XTJPWQ. You are receiving this because you were mentioned.