Open FlorinAndrei opened 4 years ago
Also for the text column. I did dropna() on that one too, and the final table now has about 18k rows.
Finally, should the filename
column even be kept in the CSV? It's not used by ML at all, is it?
The CSV is big enough as it is, might as well trim down the stuff that's not used.
Yes you're right on dropping the filename column. I added that here https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/commit/c5b145852bbf8f17f3e472eb5fb319e254a554a3
As for another dropna, there should not be anything to drop there if you're joining it with the financial dataframe where that column has already had the nulls dropped https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/blob/c5b145852bbf8f17f3e472eb5fb319e254a554a3/cloudml_prepare_local_csv.py#L8
https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/blob/4ce8e1e829ed6f6ecd74f8abbfaf91114af1201b/cloudml_prepare_local_csv.py#L31
Should there be a
df.dropna(subset=['prc_change_t2'])
here?I collected the data for the last 10 years. I get 91k rows at that point. But if I run
.dropna(subset=['prc_change_t2'])
, only about 20k rows remain. I think the NaN rows should not even be sent to ML.