What are the different 'repo_language' contained in the dataset?

CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

Apache License 2.0

3.3k stars 224 forks source link

What are the different 'repo_language' contained in the dataset? #64

Closed JoaoLages closed 3 years ago

JoaoLages commented 3 years ago

I have only found Java. Wonder if someone can spare me the details without having to process the whole dataset :) Thank you for open sourcing it! Awesome stuff!

reshinthadithyan commented 3 years ago

The best way to process some specific language file in the dataset is by using the file extension of the language you're looking for. There is _"filename" key with every datapoint. Filter using the file extension of the language. Example : data["file_name"].split(".")[-1] == "hs" for haskell. I hope this helps. Let me know if you have any more questions. Good day.

JoaoLages commented 3 years ago

Do we have any trained model for SQL?

ncoop57 commented 3 years ago

The model might have some training data that has SQL, but it will not be very representative. If you'd like a model trained specifically on SQL type code I recommend you check out this project: https://github.com/ElementAI/picard#overview

Closing this for now, feel free to reopen if you want to discuss more. However, a better place for an indepth discussion for this might be our discord channel!