Generate RDF files with project info

julianharty commented 8 months ago

Context

We'd like to be able to save the project-related data we obtain so it can be combined by us and by others with the parent data of the NLnet projects. NLnet uses RDF data formats for the parent data and can accept and combine related RDF data if we provide it.

Task

Once the RDF structures have been defined (which will probably be on https://codeberg.org/NLnet/importscripts), generate query results exported using these structures.

julianharty commented 8 months ago

RDF file formats include https://jena.apache.org/documentation/io/

julianharty commented 7 months ago

Some useful sources of information about processing RDF include:

tnzmnjm commented 7 months ago

looked at the files in the nlnet website (https://codeberg.org/NLnet/importscripts) and realised the RDF formats are Turtle (.ttl)
wrote a script to convert the a row to .ttl format.
The input (first row ) is :
- projectref 2016-12-023
- nlnetpage https://nlnet.nl/project/stubby
- repourl https://github.com/getdnsapi/stubby
- testfilecount 2
The output is: @prefix foaf: http://xmlns.com/foaf/0.1/ . @prefix ns1: https://nlnet.nl/project/ . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix xsd: http://www.w3.org/2001/XMLSchema# . ns1:2016-12-023 a foaf:Project ; rdfs:seeAlso https://github.com/getdnsapi/stubby ; foaf:homepage ns1:stubby ; ns1:testFileCount 2 .
We will be able to amend the script based on the requirement of the nlnet in the future and apply it on the whole dataset
script can be found under 'src/rdf_export/export_to_rdf.py'

tnzmnjm commented 7 months ago

pushed the current code to the branch - save the repos on a USB drive, and create a column for the last commit hash
Analysing the produced df
Filter out rows where the URL doesn't have a repository name (83 rows) before cloning the repos
Address the repourls where they start with http rather https
realised the repourls which have "/" at the end of their repo has the testfilecount value but no hash commit value : 2017-10-006a,https://nlnet.nl/project/vita,https://github.com/inters/vita/,176,nan
will need to remove the '/' at the end of the repo name
analysing the rows where the testcount is 0:

-----repourl: https://github.com/siacs/Conversations --> when I opened it: I see the github url is now : https://github.com/iNPUTmice/Conversations --> in the README section I see they have moved the code to : A New Home We've moved. Conversations is now developed on Codeberg. moved to codeberg : https://github.com/librestack/librecast

----- some repos do not have any test files : https://github.com/Ayms/node-Tor , https://github.com/pylls/padding-machines-for-tor, https://github.com/arpa2/draft-vanrein-tls-kdh, https://github.com/NLnetLabs/dnssec-ceremony-doc ,https://github.com/sensifai/Sensifai-NPU-SDK , https://github.com/blueprint-freespeech/refresh-site, https://github.com/arpa2/draft-vanrein-httpauth-sasl, https://github.com/NLnetLabs/connectbyname

------ in some repos when I searched for the word test realised it's mentioned in an issue as a text not a test file so It's fine to get 0 in the script : https://github.com/NLnetLabs/dnssec-ceremony-tools , https://github.com/simmel-project/hardware, https://github.com/MEGA65/megaphone-r4-pcb, https://github.com/stef/zphinx-zerver, https://github.com/beeldengeluid/peertube-plugin-creative-commons, https://github.com/beeldengeluid/extending-peertube, https://github.com/jobisoft/TbSync, https://github.com/otrv4/otrv4, https://github.com/rust-threadpool/rust-threadpool, https://github.com/FOSDEM/video-hardware

----- Some repourls point to an issue but not the owner+repo like: https://github.com/osresearch/heads/issues/540

when I ran the script again, for the rows where the test count was not 0, I was not getting the last hash commit, had to address this :
Ensure robust data handling in repository processing script
Decouple the conditions for skipping repositories to handle test file
counting and commit hash fetching independently.
Modify the processing loop to always attempt fetching the last commit
hash, even if test file counting was previously completed.
Include checks to clone repositories only if they do not exist, ensuring that interruptions in script execution do not prevent subsequent data capture.
Ensure consistent saving of DataFrame after every batch processing to prevent data loss.
script finished running. run it one more

Need to address :

https://github.com/tdf/odftoolkit.git & https://github.com/eduvpn/apple& https://github.com/stratosphereips/AIVPN --> test count =0 (test is mentioned in issues but not a file so it's correct) -> the latest hash is nan. Not sure why
this points to an issues not a test file - https://github.com/osresearch/heads/issues/540
https://github.com/seedvault-app/seedvault has loads of test files not sure why the testcount is 0 --> the test files end with .kt - for instance App/src/test/java/com/stevesoltys/seedvault/crypto/CryptoTest.kt --> cannot see this in the cloned repos on the hard disk either (after the second run, I can see the repo) ---- same with this repo https://github.com/jitsi/jitsi-meet & https://github.com/newaetech/chipwhisperer ( I can see test files )

change the ../utils/export_to_rdf.py to process the whole dataframe rather that line by line
Add the capability to save the result as a turtle RDF format to the script `github_repo_request_local.py

tnzmnjm commented 7 months ago

Investigation:

- https://github.com/tdf/odftoolkit.git--> This repo is abour 600MB, has 18 contributors, 18 branches and I can see loads of test files on the web. I ran the list_test_files only on his repo and it found 9667 items.
- https://github.com/stratosphereips/AIVPN ---> the script returned 0 after running on this repo again. Which is correct as I checked the web and cannot see any filenames/path with the word test but there's one file which seems to test something docs/build/_static/language_data.js
got the list of the repourls where the testfilecountlocal = 0 and the last_commit_hash is not nan: 50 https://github.com/osresearch/heads/issues/540 ---> This points to an issue not an owner+repo 132 https://github.com/eez-open/modular-psu ---> This is correct. Checked on the web 247 https://github.com/ernestwisniewski/kbin ---> repo is cloned (19 MB) - ran the function and result is 0 but can see test files on the web 257https://github.com/armijnhemel/binaryanalysis-...---> this doesn't point to an owner + repo 264https://github.com/organicmaps/organicmaps---> this hasn't been cloned 270https://github.com/chromi/sce ---> This hasn't been cloned 271https://github.com/Wakoma/nimble/tree/smart_doc---> this doesn't point to an owner + repo 288https://github.com/overte-org/overte` ---> This hasn't been cloned
- https://github.com/seedvault-app/seedvault ---> found 2 projects pointing at the same repo . Their nlnet pages are different. (https://github.com/seedvault-app/seedvault, https://nlnet.nl/project/Seedvault/,https://nlnet.nl/project/SeedVault-Integrity/) has loads of test files on the web. The cloned repo is about 300 MB. Running the test count function again ---> found 94 test files.
- https://github.com/jitsi/jitsi-meet ---> cloned repo is 416 MB - ran the count function again ---> found 13 test files.
- https://github.com/newaetech/chipwhisperer --> cloned repo is 1.1 GB - running the count function again ---> found 224 test files

commercetest / nlnet

Generate RDF files with project info #5

Context

Task