drop NaN before sending to ML?

Ben-Sherman / quarterly-earnings-machine-learning-algo

A Commission-Free Algo Trading Bot By Machine Learning Company SEC Filing Language

38 stars 18 forks source link

drop NaN before sending to ML? #3

Open FlorinAndrei opened 4 years ago

FlorinAndrei commented 4 years ago

https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/blob/4ce8e1e829ed6f6ecd74f8abbfaf91114af1201b/cloudml_prepare_local_csv.py#L31

Should there be a df.dropna(subset=['prc_change_t2']) here?

I collected the data for the last 10 years. I get 91k rows at that point. But if I run .dropna(subset=['prc_change_t2']), only about 20k rows remain. I think the NaN rows should not even be sent to ML.

FlorinAndrei commented 4 years ago

Also for the text column. I did dropna() on that one too, and the final table now has about 18k rows.

FlorinAndrei commented 4 years ago

Finally, should the filename column even be kept in the CSV? It's not used by ML at all, is it?

The CSV is big enough as it is, might as well trim down the stuff that's not used.

Ben-Sherman commented 4 years ago

Yes you're right on dropping the filename column. I added that here https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/commit/c5b145852bbf8f17f3e472eb5fb319e254a554a3

As for another dropna, there should not be anything to drop there if you're joining it with the financial dataframe where that column has already had the nulls dropped https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/blob/c5b145852bbf8f17f3e472eb5fb319e254a554a3/cloudml_prepare_local_csv.py#L8