OpenTermsArchive / contrib-versions

Documents versions that are not maintained by a dedicated actor. Maintained collaboratively by volunteer contributors.
https://docs.opentermsarchive.org/navigate-history/
Other
134 stars 9 forks source link

Check and clean up datasets #9

Closed MattiSG closed 2 years ago

MattiSG commented 2 years ago

Since 25 of January, the published dataset is not complete. It contains 90 services instead of 281, and a total of 899 versions instead of 16 504.

Probably the reason for https://github.com/ambanum/OpenTermsArchive.org/issues/102.

Hypothesis:

I guess this is consistent with the move to production on MongoDB, but it seemed to me that we had since integrated all of the historical data into Mongo. If this is the case, maybe the versions were not regenerated correctly? This is all the more surprising since the versions published on GitHub seem to be complete (there are well over 280 files in https://github.com/ambanum/OpenTermsArchive-versions).

It has not been checked if the datasets of France and Dating are complete.

Once this issue is fixed, all incomplete datasets should be erased.

Ndpnt commented 2 years ago

I think it will be fixed by https://github.com/ambanum/OpenTermsArchive/pull/748

MattiSG commented 2 years ago

According to @Ndpnt, this is because we do a shallow clone when we initially set up an instance.

martinratinaud commented 2 years ago

Fixed by https://github.com/ambanum/OpenTermsArchive/pull/804

martinratinaud commented 2 years ago

Need to wait for monday to see how the dataset is but considering the versions repo contains all commits, it should be ok

martinratinaud commented 2 years ago

I reopen until we check the dataset

Ndpnt commented 2 years ago

I run the command manually to generate a dataset, so you can check now https://github.com/ambanum/OpenTermsArchive-versions/releases

martinratinaud commented 2 years ago

I checked 285 service providers and records from 2012 on 1.2.3 greetings and from Instagram 3 days ago.

So it is OK

MattiSG commented 2 years ago

I confirm that incomplete datasets between 21/01 and 25/03 have been erased, thanks @martinratinaud! 👍