davidtorosyan / wikimit

a wikipedia-to-git converter
MIT License
0 stars 1 forks source link

Proof of concept #1

Open davidtorosyan opened 3 years ago

davidtorosyan commented 3 years ago

I ran the proof of concept for https://en.wikipedia.org/wiki/Finch to get a sense of a) Wikipedia's APIs and b) performance.

The generated repo (https://github.com/wikimit-hub/Finch) has 1090 commits. Here are the logs:

Querying wikipedia...
wiki took 0.47 seconds
Adding 100 commits...
git took 9.71 seconds
Querying wikipedia...
wiki took 0.77 seconds
Adding 100 commits...
git took 9.60 seconds
Querying wikipedia...
wiki took 1.22 seconds
Adding 100 commits...
git took 9.60 seconds
Querying wikipedia...
wiki took 0.76 seconds
Adding 100 commits...
git took 9.94 seconds
Querying wikipedia...
wiki took 0.74 seconds
Adding 100 commits...
git took 9.82 seconds
Querying wikipedia...
wiki took 0.68 seconds
Adding 100 commits...
git took 10.13 seconds
Querying wikipedia...
wiki took 0.82 seconds
Adding 100 commits...
git took 9.85 seconds
Querying wikipedia...
wiki took 0.76 seconds
Adding 100 commits...
git took 10.08 seconds
Querying wikipedia...
wiki took 0.77 seconds
Adding 100 commits...
git took 9.91 seconds
Querying wikipedia...
wiki took 0.83 seconds
Adding 100 commits...
git took 9.93 seconds
Querying wikipedia...
wiki took 0.63 seconds
Adding 89 commits...
git took 8.80 seconds
Done!
wiki took a total of 8.45 seconds
git took a total of 107.37 seconds

Conclusions:

  1. Wiki's API supports paging through entries from the beginning and using timestamp as an offset
  2. Git is our bottleneck, taking about 100ms per commit. That's pretty slow, hopefully we can speed things up by looking into more advanced git commands.
  3. The resulting git blame is pretty good for looking back into history, but could be improved (see question 4).
  4. This took a long time for a relatively tiny article.

Open questions:

  1. I'm assuming en wikipedia right now. Including other languages should be pretty easy but that needs testing.
  2. How are renames handled when looking through revision history?
  3. How should renames be handled for the repos?
  4. Should the wiki lines be split up further? Right now we're using newlines from the source, but there's nothing stopping us from adding additional newlines to help git blame along. Ideally this would be along sentence stops, but that might be hard to reliably track.

Also, the size of the resultant repo was 6.53 MB and 4,139 files.

maximilian-reinsel commented 3 years ago

Tested on linux - results are much better: 9 seconds for wikipedia and 4 seconds spent on git This is for a 7 MB output with 1500 files.

Based on our guess it seems like git for windows is slower.

maximilian-reinsel commented 3 years ago

Re-ran the same finch test:

Querying wikipedia...
wiki took 0.87 seconds
Adding 100 commits...
git took 0.68 seconds
Querying wikipedia...
wiki took 2.81 seconds
Adding 100 commits...
git took 1.68 seconds
Querying wikipedia...
wiki took 2.18 seconds
Adding 100 commits...
git took 2.73 seconds
Querying wikipedia...
wiki took 1.15 seconds
Adding 100 commits...
git took 1.60 seconds
Querying wikipedia...
wiki took 1.39 seconds
Adding 100 commits...
git took 1.04 seconds
Querying wikipedia...
wiki took 2.84 seconds
Adding 100 commits...
git took 2.80 seconds
Querying wikipedia...
wiki took 2.92 seconds
Adding 100 commits...
git took 0.74 seconds
Querying wikipedia...
wiki took 2.70 seconds
Adding 100 commits...
git took 0.75 seconds
Querying wikipedia...
wiki took 2.87 seconds
Adding 100 commits...
git took 0.93 seconds
Querying wikipedia...
wiki took 2.79 seconds
Adding 100 commits...
git took 0.97 seconds
Querying wikipedia...
wiki took 2.31 seconds
Adding 89 commits...
git took 1.71 seconds
Done!
wiki took a total of 24.84 seconds
git took a total of 15.65 seconds