Powerkrieger / NobbyGPT

Reimplementation of nanoGPT for educational purposes
0 stars 0 forks source link

3a. Test Weeve data-extraction #3

Closed Kuckuck44 closed 9 months ago

Kuckuck44 commented 10 months ago

Can we use the weeve Plugin to generate our fine tuning-data?

Powerkrieger commented 10 months ago
image
Powerkrieger commented 10 months ago

So the free version uses a very limited number of translated palabras. But from the three words that were changed, "At least" to "Mindestens" (correct), "common" to "gemeinsamen" (incorrect, would have been common in the sense of widely used), and the screenshot above, "VS" into "Gegen", when it is supposed to be Visual Studio ... it is already looking like this extension is not as powerful as it seems. Maybe a cool next project would be to train an AI to actually choose good words to swap. Weeve.ie does not seem to be good at it anyways.

Kuckuck44 commented 10 months ago

image

At least its funny :D But maybe we actually need a new way of generating such data. Despite that, I will generate a few samples.

Powerkrieger commented 10 months ago
image

Its not too bad on second thought. I am working on extracting the data right now. its not pure html anymore but the words are highlighted inline with loads of style and stuff around the word. So this requires some extraction. Question is what websites to use for this. Also, what is going to be the structure of our data?

Powerkrieger commented 10 months ago

New plan as of right now:

Kuckuck44 commented 10 months ago

Maybe we should have a short Discord-Meeting, because I am working on the same task right now.