How to get the partial data for our dataset

JCbette commented 2 weeks ago

Thank you so much for your excellent work, I just had a quick question about selecting partial data from another dataset. I noticed that some methods randomly sample the dataset while keeping the category distribution similar to the original dataset. How do you realize that process? Thanks again for your reply.

Haru-zt commented 1 week ago

I randomly selected a dataset and then repeated the selection 100,000 times to find the one with the smallest standard deviation from the dataset's distribution.

---Original--- From: @.> Date: Fri, Oct 11, 2024 15:26 PM To: @.>; Cc: @.***>; Subject: [Haru-zt/DDPLS] How to get the partial data for our dataset (Issue#1)

Thank you so much for your excellent work, I just had a quick question about selecting partial data from another dataset. I noticed that some methods randomly sample the dataset while keeping the category distribution similar to the original dataset. How do you realize that process? Thanks again for your reply.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JCbette commented 1 week ago

Thank you for your reply, we are very interested in your work and would like to follow your work, we would like to use your model on our own dataset. Is it convenient to provide your code to filter the dataset? Thank you so much!

Haru-zt commented 1 week ago

It depends on the size of your dataset and the proportion of semi-supervised learning. Based on my experience, if the dataset is not too small and the proportion of semi-supervised learning exceeds 10%, the performance variance of direct random selection is very small. The settings in the paper are for consistency with previous work. If conditions permit, random 5-fold cross-validation is a good choice. As for the script, I think you can write one yourself, and I will look through historical code to see if I can find it. It's been a while, so it might not be possible to find it, or you can ask GPT.

Haru-zt / DDPLS

How to get the partial data for our dataset #1