Closed julianharty closed 7 months ago
RDF file formats include https://jena.apache.org/documentation/io/
Some useful sources of information about processing RDF include:
looked at the files in the nlnet website (https://codeberg.org/NLnet/importscripts) and realised the RDF formats are Turtle (.ttl)
wrote a script to convert the a row to .ttl format.
The input (first row ) is :
The output is: @prefix foaf: http://xmlns.com/foaf/0.1/ . @prefix ns1: https://nlnet.nl/project/ . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix xsd: http://www.w3.org/2001/XMLSchema# . ns1:2016-12-023 a foaf:Project ; rdfs:seeAlso https://github.com/getdnsapi/stubby ; foaf:homepage ns1:stubby ; ns1:testFileCount 2 .
We will be able to amend the script based on the requirement of the nlnet in the future and apply it on the whole dataset
script can be found under 'src/rdf_export/export_to_rdf.py'
repourl
s where they start with http
rather https
repourls
which have "/" at the end of their repo has the testfilecount
value but no hash commit value : 2017-10-006a,https://nlnet.nl/project/vita,https://github.com/inters/vita/,176,nantestcount
is 0:-----repourl: https://github.com/siacs/Conversations
--> when I opened it: I see the github url is now : https://github.com/iNPUTmice/Conversations
--> in the README section I see they have moved the code to : A New Home We've moved. Conversations is now developed on Codeberg. moved to codeberg : https://github.com/librestack/librecast
----- some repos do not have any test files : https://github.com/Ayms/node-Tor
, https://github.com/pylls/padding-machines-for-tor
, https://github.com/arpa2/draft-vanrein-tls-kdh
, https://github.com/NLnetLabs/dnssec-ceremony-doc
,https://github.com/sensifai/Sensifai-NPU-SDK
, https://github.com/blueprint-freespeech/refresh-site
, https://github.com/arpa2/draft-vanrein-httpauth-sasl
, https://github.com/NLnetLabs/connectbyname
------ in some repos when I searched for the word test
realised it's mentioned in an issue as a text not a test file so It's fine to get 0 in the script : https://github.com/NLnetLabs/dnssec-ceremony-tools
, https://github.com/simmel-project/hardware
, https://github.com/MEGA65/megaphone-r4-pcb
, https://github.com/stef/zphinx-zerver
, https://github.com/beeldengeluid/peertube-plugin-creative-commons
, https://github.com/beeldengeluid/extending-peertube
, https://github.com/jobisoft/TbSync
, https://github.com/otrv4/otrv4
, https://github.com/rust-threadpool/rust-threadpool
, https://github.com/FOSDEM/video-hardware
----- Some repourls
point to an issue but not the owner+repo like: https://github.com/osresearch/heads/issues/540
when I ran the script again, for the rows where the test count was not 0, I was not getting the last hash commit, had to address this :
Ensure robust data handling in repository processing script
Decouple the conditions for skipping repositories to handle test file
counting and commit hash fetching independently.
Modify the processing loop to always attempt fetching the last commit
hash, even if test file counting was previously completed.
Include checks to clone repositories only if they do not exist, ensuring that interruptions in script execution do not prevent subsequent data capture.
Ensure consistent saving of DataFrame after every batch processing to prevent data loss.
script finished running. run it one more
Need to address :
https://github.com/tdf/odftoolkit.git
& https://github.com/eduvpn/apple
& https://github.com/stratosphereips/AIVPN
--> test count =0 (test is mentioned in issues but not a file so it's correct) -> the latest hash is nan. Not sure whyhttps://github.com/osresearch/heads/issues/540
https://github.com/seedvault-app/seedvault
has loads of test files not sure why the testcount
is 0 --> the test files end with .kt - for instance App/src/test/java/com/stevesoltys/seedvault/crypto/CryptoTest.kt
--> cannot see this in the cloned repos on the hard disk either (after the second run, I can see the repo) ---- same with this repo https://github.com/jitsi/jitsi-meet
& https://github.com/newaetech/chipwhisperer
( I can see test files )../utils/export_to_rdf.py
to process the whole dataframe rather that line by lineInvestigation:
https://github.com/tdf/odftoolkit.git
--> This repo is abour 600MB, has 18 contributors, 18 branches and I can see loads of test files on the web. I ran the list_test_files
only on his repo and it found 9667 items.https://github.com/stratosphereips/AIVPN
---> the script returned 0 after running on this repo again. Which is correct as I checked the web and cannot see any filenames/path with the word test
but there's one file which seems to test something docs/build/_static/language_data.js
got the list of the repourls
where the testfilecountlocal = 0
and the last_commit_hash
is not nan
:
50 https://github.com/osresearch/heads/issues/540
---> This points to an issue not an owner+repo
132 https://github.com/eez-open/modular-psu
---> This is correct. Checked on the web
247 https://github.com/ernestwisniewski/kbin ---> repo is cloned (19 MB) - ran the function and result is 0 but can see test files on the web 257
https://github.com/armijnhemel/binaryanalysis-...---> this doesn't point to an owner + repo 264
https://github.com/organicmaps/organicmaps---> this hasn't been cloned 270
https://github.com/chromi/sce ---> This hasn't been cloned 271
https://github.com/Wakoma/nimble/tree/smart_doc---> this doesn't point to an owner + repo 288
https://github.com/overte-org/overte` ---> This hasn't been cloned
https://github.com/seedvault-app/seedvault
---> found 2 projects pointing at the same repo . Their nlnet
pages are different. (https://github.com/seedvault-app/seedvault
, https://nlnet.nl/project/Seedvault/
,https://nlnet.nl/project/SeedVault-Integrity/
)
has loads of test files on the web. The cloned repo is about 300 MB. Running the test count function again ---> found 94 test files.https://github.com/jitsi/jitsi-meet
---> cloned repo is 416 MB - ran the count function again ---> found 13 test files.https://github.com/newaetech/chipwhisperer
--> cloned repo is 1.1 GB - running the count function again ---> found 224 test files
Context
We'd like to be able to save the project-related data we obtain so it can be combined by us and by others with the parent data of the NLnet projects. NLnet uses RDF data formats for the parent data and can accept and combine related RDF data if we provide it.
Task
Once the RDF structures have been defined (which will probably be on https://codeberg.org/NLnet/importscripts), generate query results exported using these structures.