Closed bhuffaker closed 9 months ago
Hi Bradley, for the very first part, do I need to care about files that are deleted under recipes as they're not json objects? For example, 2 commits that deleted files were mine and that's simply because the files aren't useful anymore.
I think git log --diff-filter=D
only shows us commits where we have deleted a file? I went on stackoverflow and found this interesting option git log --diff-filter=D --summary
that shows us the names of files that have been deleted? I also didn't understand "parse out the list of object is soureces/(type)/(name) => id=(type):(name)" . Thank you.
On Aug 27, 2022, at 4:15 PM, VdotR @.***> wrote:
Hi Bradley, for the very first part, do I need to care about files that are deleted under recipes as they're not json objects?
no
For example, 2 commits that deleted files were mine and that's simply because the files aren't useful anymore.
You should be comparing against master. These files were exposed publicly. I think git log --diff-filter=D only shows us commits where we have deleted a file? I went on stackoverflow and found this interesting option git log --diff-filter=D --summary that shows us the names of files that have been deleted?
That's fine I also didn't understand "parse out the list of object is soureces/(type)/(name) => id=(type):(name)" . Thank you.
You can infer the object’s id from the object’s path on disk ie sources/(type)/(name) where id is (type):(name).
So if you have a file call sources/dataset/cats.json the object you needs to create a depercated file for will have the id dataset:cats.
You should be able to checkout the last version with the file and get it’s origial name etc.
Bradleyu
Thanks Bradley. Is https://api.catalog.caida.org/ the CAIDA catalog api url?
yes
On Aug 29, 2022, at 3:24 PM, VdotR @.***> wrote:
Thanks Bradley. Is https://api.catalog.caida.org/ https://api.catalog.caida.org/ the CAIDA catalog api url?
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1230935079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7PE4JE32EDSLPJGXWDV3U2BTANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
Hi Bradley, I just got the printing current ids script to work.
I notice that every python file in the scripts directory follows the same format which contains a very long header and sections such as global variables. Should I follow similar format? Is there a guide on how to contribute to scripts like what I had when writing recipes?
We don't have a style guide. But all else equal, it is better for the various scripts to look similar.
Hi Bradley, I'm now going through each of the commits that deleted files. I didn't quite understand what "Find the commit when that file was removed and see what it was" meant. Do you mean that we should find out why the file was deleted based on the commit messages? There seems to be not enough info to infer. I used git diff <HEAD commit> <commit that deleted file> --name-status
When I look into the catalog-data repo I only see a very small number of objects. I think in a recent commit by David he moved caida datasets, but I don't know where he moved these to.
They were moved to catalog-data-caida, but this doesn't matter as long as they are still in the catalog. Only ids that are no longer in the catalog, which you can find from the API, need a redirect.
" Do you mean that we should find out why the file was deleted based on the commit messages?" no, you only need to get the object's name and see if there exists something simialir that it could have been mapped to. You want to go to the commit just before the object was deleted so you can see it's name and description, then look in the catalog for something similar, and if they look like they are the same create a redirect.
We will look at the example software:dzdb_api, which was conslidated into dataset:dzdb.
If you can't find any reasonable redirect make a list and post the list of ids to this issue.
Thank you Bradley, where do I create the redirect? Under catalog-data-caida? Also, I don't quite understand what "example it" means, seems like I just need to create redirects for the files that have been deleted but are still in the API?
I am moving on to the last step (creating redirects) and I just want ot make sure everything I did so far is correct.
git log --diff-filter=D --summary
I found all files that have been deleted. I then used grep
with keyword ".json" to get all the objects, and then I used a few more unix commands to transform these lines to the type:name formatcomm -12 <deleted objects> <objects in api>
I was able to get all the objects that have been deleted but are still in the api which need redirect.Sorry for all the inconvenices.
Hi Bradley, can you give me a list of the types of objects that I need to care about (especially person since there're lots of them)? I know that I don't need to worry about recipes, but what else? Below are all folders under sources:
Also, I found that there are ~1300 objects (including every type) that need redirect: around 30 datasets, 500 media, 400 papers, 350 persons. Is that expected?
yes ignore persons
On Sep 2, 2022, at 9:24 PM, VdotR @.***> wrote:
There are also many deleted "person" objects, do I ignore those as well?
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1236045611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7IZWVXV77MH4XWVPDTV4LHJVANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
Thanks Bradley. I just want to confirm (so that I don't waste time doing the wrong things) that it is expected to have ~1000 removed objects (excluding recipe and person) in catalog-data? And the types I need to care about are dataset, media, paper, and software?
Is that 1000 before or after you checked the ids against api.catalog.caida.org?
On Sep 5, 2022, at 3:46 PM, VdotR @.***> wrote:
Thanks Bradley. I just want to confirm (so that I don't waste time doing the wrong things) that it is expected to have ~1000 removed objects (excluding recipe and person) in catalog-data? And the types I need to care about are dataset, media, paper, and software?
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237509806, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7LMQ237KA6R6XVRVBTV4ZZ3HANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
After. There are ~1500 removed ids before comparing against api.catalog.caida.org
can you make a list and send it to me
On Sep 5, 2022, at 3:51 PM, VdotR @.***> wrote:
After. There are ~1500 removed ids before comparing against api.catalog.caida.org
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237511002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7IMTSHBFZ5FWID4LOLV4Z2NJANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
I think I found the problem. For many of the old files, the id is different as the filename. Take this paper as an example: https://github.com/CAIDA/catalog-data/blob/7aeb7ad9bfe72d94a762d246f83d1a347218a444/sources/paper/2012_the_4th_workshop_on_active_internet_measurements_aims4_report.json
Json is named as "2012_the_4th_workshop_on_active_internet_measurements_aims4_report.json" but the id is named as "2012_aims4_report", which I found in api.catalog.caida.org - I think most of them falls here. git log --diff-filter=D
will only give us the file name, is there a possible way to get all the removed ids instead of all the removed filenames?
Solution I have in mind is that I write another python script under scripts that extracts all deleted ids, but I will need to run many unix commands under python. Is there a good way to do this? For example, how can I store the output of git log --diff-filter=D
as a string?
I want to do a sanity check. Please send me the list first.
On Sep 5, 2022, at 4:05 PM, VdotR @.***> wrote:
Solution I have in mind is that I write another python script under scripts that extracts all deleted ids, but I will need to run many unix commands under python. Is there a good way to do this? For example, how can I store the output of git log --diff-filter=D as a string?
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237514990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7KWXHCTQRNQ3NJVGEDV4Z4EBANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
OK. need_redirect.txt This is all ids(filenames) that have been removed but not in catalog. currobjs.txt This is all ids in catalog data api. git_del_rec.txt This is all deleted files (formatted)
You should be able to compare the ‘deleted’ files to the files currently in catalog-data or catalog-data-caida.
Clone catalog-data-caida (https://github.com/CAIDA/catalog-data-caida) as a subdirectory of catalog-data (ie catalog-data/catalog-data-caida). Then you can write you script to search for the filename (starting in catalog-data directory) in sources and catalog-data-caida/sources, in addition to checking for the ids in api.catalog.caida.org.
On Sep 5, 2022, at 4:16 PM, Bradley Huffaker @.***> wrote:
On Sep 5, 2022, at 4:09 PM, VdotR @. @.>> wrote:
OK. need_redirect.txt https://github.com/CAIDA/catalog-data/files/9492256/need_redirect.txt This is all ids(filenames) that have been removed but not in catalog. currobjs.txt https://github.com/CAIDA/catalog-data/files/9492259/currobjs.txt This is all ids in catalog data api. git_del_rec.txt https://github.com/CAIDA/catalog-data/files/9492261/git_del_rec.txt This is all deleted files (formatted)
thi file also appears to have ids: dataset:dzdb dataset:ipv6_allpref_topology dataset:passive_2017_pcap dataset:passive_realtime dataset:telescope_ddos
Can you send me the file+path sources/dataset/dzdb.json
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237516326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7LI4UVIRJ6KIJVVETTV4Z4STANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
Thanks! So it is necessary to extract the id from the file given the file name. Are there good tools to run unix commands in python and store the output as variable? For example, I would want a string that stores output of git log --diff-filter=D
catalog-data-caida is a different directory. So a file moved from catalog-data to catalog-data-caida, will be deleted from catalog-data, but the file will still be on catalog-data-caida
On Sep 5, 2022, at 4:41 PM, VdotR @.***> wrote:
Thanks! But I think git log --diff-filter=D shows all removals in the ENTIRE directory?
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237527942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7NNWANSBQLRUYEUAG3V42ALXANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
Hi Bradley, I rewrote the entire script. It now gets the ids removed instead of filename, and it automates all the process except for creating json files. According to my script there are 177 files that were removed but not in catalog. (excluding catalog-data-caida)
send me the list of 177 filenames
Here you go: ids_to_remove.txt
Are the object ids case sensitive? For example, the object with id "media:1999_Crisp9912" was detected by my script as deleted but not in catalog data api. However, I found this https://catalog.caida.org/media/1999_crisp9912 in catalog data api, the only difference is that one of the "c" in the former id is capitalized.
I find most of the ids generated fall to similar cases as above: contents are exactly the same yet the ids have different capitalization and usage of hyphen/underscore
Are the object ids case sensitive? For example, the object with id "media:1999_Crisp9912" was detected by my script as deleted but not in catalog data api. However, I found this https://catalog.caida.org/media/1999_crisp9912 in catalog data api, the only difference is that one of the "c" in the former id is capitalized.
No. Ids are not case sensitive. Use https://github.com/CAIDA/catalog-data/blob/master/scripts/lib/utils.py#L10 to convert your strings into ids. this will make everything lowercase, replace hyphen with underscore, etc
I don't expect a lot of ids to be missing.
I finished redirects.csv for catalog-data and is now working on creating redirects.csv for catalog-data-caida. Below are two objects in catalog-data that I didn't find a good redirect of:
First entry is id, second entry is filename, third entry is last commit before removing the file.
[x] ['paper:2016_new_approaches_old_challenges-tr', 'sources/paper/2016_new_approaches_to_old_challenges_with_as_traceroute.json', '7aeb7ad9bfe72d94a762d246f83d1a347218a444']
[ ] ['dataset:caida_internet_traffic', 'sources/dataset/caida_internet_traffic.json', 'ed008a6301a183f247a7593c956e2fe800f2ce86']
I finished redirects.csv for catalog-data and is now working on creating redirects.csv for catalog-data-caida. Below are two objects in catalog-data that I didn't find a good redirect of:
- [X] ['paper:2016_new_approaches_old_challenges-tr', 'sources/paper/2016_new_approaches_to_old_challenges_with_as_traceroute.json', '7aeb7ad9bfe72d94a762d246f83d1a347218a444']
Let this one go.
- [ ] ['dataset:caida_internet_traffic', 'sources/dataset/caida_internet_traffic.json', 'ed008a6301a183f247a7593c956e2fe800f2ce86']
@eyulaeva1 do you now what this data should be redirected to?
Looked at the caida_internet_traffic.json it has link to paper:2019_hypersparse_neural_network_analysis and is actually passive_metadata
On Sep 5, 2022, at 4:09 PM, VdotR @.***> wrote:
OK. need_redirect.txt https://github.com/CAIDA/catalog-data/files/9492256/need_redirect.txt This is all ids(filenames) that have been removed but not in catalog. currobjs.txt https://github.com/CAIDA/catalog-data/files/9492259/currobjs.txt This is all ids in catalog data api. git_del_rec.txt https://github.com/CAIDA/catalog-data/files/9492261/git_del_rec.txt This is all deleted files (formatted)
thi file also appears to have ids: dataset:dzdb dataset:ipv6_allpref_topology dataset:passive_2017_pcap dataset:passive_realtime dataset:telescope_ddos
Can you send me the file+path sources/dataset/dzdb.json
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1237516326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7LI4UVIRJ6KIJVVETTV4Z4STANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
sources/dataset/dzdb.json sources/dataset/ipv6_allpref_topology.json sources/dataset/passive_2017_pcap.json sources/dataset/passive_realtime.json sources/dataset/telescope_ddos.json
I can't send json files here so I'll send it via mattermost
What are they files and why are they not in your banch?
On Oct 11, 2022, at 2:55 PM, VdotR @.***> wrote:
sources/dataset/dzdb.json sources/dataset/ipv6_allpref_topology.json sources/dataset/passive_2017_pcap.json sources/dataset/passive_realtime.json sources/dataset/telescope_ddos.json
I can't send json files here so I'll send it via mattermost
— Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1275318396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7JNTCXWSCHYN7W3LU3WCXO3NANCNFSM57XTJPWQ. You are receiving this because you authored the thread.
The files are here: https://github.com/CAIDA/catalog-data/pull/511/files
# Stores the old_id to new_id redirects
redirect_id_id = {}
# Store all the children of new_id
redireect_id_children = {}
# This is the list of nods that will now need to be
# redirected to new_id. This checks if old_id is already
# the root of an existing tree
# A -> B
# B -> C
# We need to redirect not only B to C, but also all the nodes
# nodes that pointed to B
children = [old_id]
if old_id in redirect_id_children:
children.extend(redireect_id_children[old_id])
# check if new_id is a child of old_id
if new_id in children:
utils.error(filename, f"[{linenum}] loop found between {old_id} and {new_id}")
continue
# If old_id has children, forget them they belong to new_id
if old_id in redirect_id_children:
del redirect_id_children[old_id]
# If new_id doesn't yet have children, add a set
if new_id not in redirect_id_chlidren:
redirect_id_children[new_id] = set()
# Add in all the new children
for child in children
redirect_id_chlidren[new_id],add(child)
@VdotR what's your current status of this task?
@jes089 I worked on this task last quarter and did a naive implementation which required reading the redirects csv file twice. Bradley asked me to complete the task with one read only but somehow I couldn't get it to work. I pushed all the code I had and is waiting for Bradley's response. Right now I'm working on other tasks.
https://github.com/CAIDA/catalog-data/pull/511/files this pull request only includes JSON files. Not your script changes. Please commit your changes to the branch and let us know which branch it is.
@bhuffaker Here's the branch: https://github.com/CAIDA/catalog-data/tree/118-routeviews_prefix2as
Here's "data-build.py" in the branch: https://github.com/CAIDA/catalog-data/blob/118-routeviews_prefix2as/scripts/data-build.py
I changed my mind. You don't need to merge the code. @jsun can you make sure that the code does what it is suppose to?
Finished writing an initial version of scripts/find-removed-ids.py which converts file names to ids. However, the code produced way more ids that are not currently in the catalog than we expected. After searching some of the ids we realized that many objects that were considered not in the catalog are actually in the catalog. For example, the paper with id "paper:2007_two_days_in_the_life_of_the_dns_anycast_root_servers", which is wasn't found in the catalog, is in the catalog with the id "2007_dns_anycast".
There are two reasons that might have caused this: First we assumed that objects with name example.json will have the id "category:example", but that may not be the case for some early documents. Second the ids and names of the objects might have changed at some point.
Therefore, we decided to use the edit distance algorithm on top of comparing ids with current objects in the catalog. Basically, for deleted ids that are not in the catalog, we will calculate its edit distance with the NAME of objects in the catalog. For each missing id, if we found an object in the catalog with name similar to the missing id then we'll consider the object to be in the catalog. To improve the performance I'll edit the id (e.g. delete year, make underlines and dashes to be whitespaces) before calculating edit distance. Working on the edit distance part now.
@bhuffaker
Here are the final deleted objects which I couldn't find anywhere:
media:2020_artemis_uknof45 Solutions:what-is-an-asn paper:2016_new_approaches_to_old_challenges_with_as_traceroute dataset:skitter_router_level_topology_measurements paper:1999_experimental_study_of_internet_stability_and_backbone_failures dataset:passive_equinix_nyc dataset:passive_generic dataset:anycast_dataset dataset:passive_statistics software:as_organization_api dataset:ipv6_allpref_topology_dataset media:2005_iinternet_applications_drivers_of_growth_20052015 dataset:skitter_macroscopic_topology_data dataset:euro_ix_ixp_service_matrix
@Phileodontist @eyulaeva1 Do you know any thing about these datasets? These are objects that existed in the catalog-data repository before but were deleted at some point.
dataset:skitter_router_level_topology_measurements dataset:passive_equinix_nyc dataset:passive_generic dataset:anycast_dataset dataset:passive_statistics software:as_organization_api dataset:ipv6_allpref_topology_dataset dataset:skitter_macroscopic_topology_data dataset:euro_ix_ixp_service_matrix
@VdotR Most of the datasets either changed in name or got consolidated. Most if not all of these should be in the catalog-data-caida repository.
Certain: dataset:passive_statistics → dataset:passive_metadata software:as_organization_api → dataset:as_organization dataset:ipv6_allpref_topology_dataset → dataset:ipv6_allpref_topology dataset:skitter_macroscopic_topology_data → dataset:skitter_itdk dataset:skitter_router_level_topology_measurements → dataset:skitter_router_adjacencies
Uncertain: dataset:passive_equinix_nyc → dataset:passive_2018_pcap & dataset:passive_2019_pcap dataset:passive_generic → dataset:passive_realtime
@Phileodontist Thanks for the response. So are there anything I can do about the uncertain datasets?
@bhuffaker You have any recollection of the follow?
dataset:passive_equinix_nyc → dataset:passive_2018_pcap & dataset:passive_2019_pcap
dataset:passive_generic → dataset:passive_realtime
If we can’t find matching anchor files. Let them go to 404.
On Sep 21, 2023, at 12:15 PM, Philip Leo Pascual @.***> wrote:
@bhuffaker https://github.com/bhuffaker You have any recollection of the follow?
dataset:passive_equinix_nyc → dataset:passive_2018_pcap & dataset:passive_2019_pcap dataset:passive_generic → dataset:passive_realtime — Reply to this email directly, view it on GitHub https://github.com/CAIDA/catalog-data/issues/467#issuecomment-1730156940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECPT7MZZT4224LBFEOD5ALX3SG6LANCNFSM57XTJPWQ. You are receiving this because you were mentioned.
We want to find a list of objects which have been deleted from catalog-data, but have not been created some other location in the catalog. Write a scripts/find-removed-ids.py.
First we need to get a list of all the files deleted and find those which where objects and save this list of ids to a file
git log --diff-filter=D
Then we need to get a list of the current ids in the catalog (sudo code below)
Then print out the set of ids that where removed, but current not in the catalog
Find the commit when that file was removed and see what it was. See if you can find a matching objet in the catalog to redirect to. Start with that file and create a matching object with visibility hidden and a deprecated object set: