bin123apple / Fortran2Cpp

Fortran2Cpp: A new model designed for the Code translation between the Fortran and C++
Apache License 2.0
4 stars 1 forks source link

token count 600 limit : data filtering: justify the choice #25

Open chunhualiao opened 3 weeks ago

chunhualiao commented 3 weeks ago

show token count distribution of source dataset.

bin123apple commented 2 weeks ago

I tried token count equals to 300, 600, 900, and 1200. When the token count is larger than 900 and 1200, it is usually hard for the GPT-4 to fix all the bugs during the execution feedback process. But anyway this is parameter that we can adjust.

And for the token count distribution of source dataset. Do you want me to show the distribution for the source dataset or the dataset after deleting comments? Because token count 600 limit is for control the length of the code without comments (dataset_generation/engine_F2C.py line 428-437) :

 # Step 1: delete the comments of the source fortran code
fortran_wo_com = delete_comments.format(Fortran_Code = fortran_code)
fortran_wo_com = generate_str_answer_gpt(fortran_wo_com, max_tokens)
fortran_wo_com = fortran_wo_com.encode().decode('unicode_escape','replace')
print(f"fortran_wo_com:\n{fortran_wo_com}")
encoding = tiktoken.encoding_for_model("gpt-4")
token_count = len(encoding.encode(fortran_wo_com,disallowed_special=()))     
print("fortran_wo_com length", token_count)
# control the length
if token_count < 600:   

This step is finished by gpt-4 and we did not save the all code without comments.

chunhualiao commented 1 week ago

show distribution after comments removal. If too expensive to redo with gpt4, write python code to strip off comments.

Science is about data generation and analysis. please Do save anything including intermediate results, prompts and responses. So others can verify, reproduce or improve.

bin123apple commented 1 week ago

I just uploaded the statistic python script, it is under dataset_generation/utils/statistic.py. By running the script python dataset_generation/utils/statistic.py, you can get the Number of data entries with tokens less than {token_threshold} and a token distribution figure.

I set the token_threshold to $600$ and get the following output:

Working on the 0th code...
Working on the 10000th code...
Working on the 20000th code...
Working on the 30000th code...
Working on the 40000th code...
Working on the 50000th code...
Working on the 60000th code...
Working on the 70000th code...
Working on the 80000th code...
Number of data entries with tokens less than 600: 40482