current12 / Stat-222-Project

3 stars 0 forks source link

Create Fixed Train Test Split #54

Closed ijyliu closed 4 months ago

ijyliu commented 5 months ago

We need to create a train-test split in the data so we can make valid comparisons between different classifiers.  

ijyliu commented 5 months ago

@current12 i am starting on this now because I need it to construct some features on the training set only

ijyliu commented 5 months ago

done

@current12 please implement the fixed train-test split 'train_test_80_20' in regressions

current12 commented 5 months ago

I tried the 'train_test_80_20', it has an issue that doesn't include credit risk "D" data in the test set. I use the train_test_split() with random_state as 2 and in this split we can include all categories

ijyliu commented 5 months ago

I don't think that's a problem

If we had to we could change the variable so it is made in the all data NLP file (the last file) and so it's a stratified 80/20 split. But don't think we need to

On Sun, Mar 31, 2024, 10:42 PM CHENG ZHENGXING @.***> wrote:

I tried the 'train_test_80_20', it has an issue that doesn't include credit risk "D" data in the test set. I use the train_test_split() with random_state as 2 and in this split we can include all categories

— Reply to this email directly, view it on GitHub https://github.com/current12/Stat-222-Project/issues/54#issuecomment-2029198130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4NXYI6GJGK3X5E2L7TY3DXSXAVCNFSM6AAAAABEYMDJNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRZGE4TQMJTGA . You are receiving this because you were assigned.Message ID: @.***>

ijyliu commented 5 months ago

redid sampling and now all classes have train and test data

please adjust all code to use 'train_test_80_20', which is now updated

current12 commented 5 months ago

there is no train_test_80_20, do you know why

image
ijyliu commented 5 months ago

fixed, run git pull

On Tue, Apr 2, 2024 at 12:21 AM CHENG ZHENGXING @.***> wrote:

there is no train_test_80_20, do you know why image.png (view on web) https://github.com/current12/Stat-222-Project/assets/73266307/37be2fe6-7e5b-466a-a6fe-8613c18e17af

— Reply to this email directly, view it on GitHub https://github.com/current12/Stat-222-Project/issues/54#issuecomment-2031251180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4KU3N5F2JP7HN3YDTLY3JL7HAVCNFSM6AAAAABEYMDJNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGI2TCMJYGA . You are receiving this because you were assigned.Message ID: @.***>

current12 commented 5 months ago

fixed, run git pull On Tue, Apr 2, 2024 at 12:21 AM CHENG ZHENGXING @.> wrote: there is no train_test_80_20, do you know why image.png (view on web) https://github.com/current12/Stat-222-Project/assets/73266307/37be2fe6-7e5b-466a-a6fe-8613c18e17af — Reply to this email directly, view it on GitHub <#54 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4KU3N5F2JP7HN3YDTLY3JL7HAVCNFSM6AAAAABEYMDJNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGI2TCMJYGA . You are receiving this because you were assigned.Message ID: @.>

I didn't see the update in the data, could you have a look?

ijyliu commented 5 months ago

for me it seems to be on the files in

Data/All_Data/All_Data_with_NLP_Features

ijyliu commented 5 months ago

it was fixed in this commit

https://github.com/current12/Stat-222-Project/commit/33791cf1114d0953521ae9788f3917f455a5e780#diff-5ce97f711b66ada0cd1575d388c5bb723fa90c5a6ae8efc943a0be8252227e23

if you look at the df at the bottom here, it has it

https://github.com/current12/Stat-222-Project/blob/33791cf1114d0953521ae9788f3917f455a5e780/Code/Data%20Loading%20and%20Cleaning/All%20Data/Create%20Combined%20All%20Data%20with%20NLP%20Features.ipynb

current12 commented 5 months ago

work now! thx!

ijyliu commented 4 months ago

regression code now uses new split