commercetest / nlnet

Analysis of the opensource codebases of NLnet sponsored projects.
MIT License
0 stars 0 forks source link

TSV and/or CSV file formats #10

Closed julianharty closed 4 months ago

julianharty commented 5 months ago

Context

NLnet foundation provides the data using tabs to separate values in a row of data (commonly known as TSV values). Separating values with commas is much more popular (commonly known as csv files). It's easy to mistakenly feed one to a program while expecting the other format.

Let's enhance this project so it can process both formats.

Thoughts

There are various ways this can be done, such as how Microsoft Windows does where the extension part of a filename is used to indicate the format .tsv is considered to be tab separated, and .csv is considered to be comma separated. It's also possible to read and parse the contents of a file and check for commas in each row and tabs in each row then infer the separator based on the results of parsing the contents of the file.

julianharty commented 5 months ago

Some interim thoughts on addressing this

There are essentially two distinct repositories of information:

  1. The source data from NLnet in TSV format; this lists repositories hosted with a range of code hosting providers including github.com, gitlab.com, and others.
  2. A progress file for the queries being performed against the projects hosted on github.com this has been implemented as a CSV file and has an extra column (currently).

A reasonable next step is to revise the code that processes the repos hosted on github.com so it uses the progress file provided it exists, otherwise it checks for the TSV file and uses that to populate the progress file. Note: it may be worth exploring what facilities are available to persist the progress file as a Dataframe on disk.

julianharty commented 5 months ago

Aa part of the improvements I also plan to rename the current file so it's clear it only processes repos hosted on github.com

tnzmnjm commented 4 months ago
tnzmnjm commented 4 months ago