almartin82 / mapvizieR

visualizations and reports for the NWEA MAP assessment in R
Other
17 stars 6 forks source link

mapvizieR function parsing some dates incorrectly #321

Open chrishaid opened 7 years ago

chrishaid commented 7 years ago

I've got dates such 01/12/2017 (January 12th, 2017) getting parsed as 2017-12-01 (Dec 1st, 2017). But most dates are fine. Our dates come in as text fields and are unadultered from how NWEA puts them in the CDF

This bit of code (i.e., the munge_startdate() function) is the culprit. From the docs:

When several format-orders are specified parse_date_time sorts the supplied format-orders based on a training set and then applies them recursively on the input vector.

I get s sense it's guessing wrong on the those ambiguous dates. Like, what is this so called "training set"?

Possible remedy we sample the teststartdate column, infer the format for each sampled date, then set the date order fromat (ymd, mdy, etc) to be used by the parser explicitly by taking the most common format in the sample?

Thoughts @almartin82?

almartin82 commented 7 years ago

Does MAP always do this consistently? Or (ugh) does it change over cdfs?

On May 17, 2017 1:04 PM, "Chris Haid" notifications@github.com wrote:

I've got dates such 01/12/2017 (January 12th, 2017) getting parsed as 2017-12-01 (Dec 1st, 2017). But most dates are fine. Our dates come in as text fields and are unadultered from how NWEA puts them in the CDF

This bit of code https://github.com/almartin82/mapvizieR/blob/551c01fc7c9ac10ff4fdcb1c987dbd9484e0bfd6/R/util.R#L441-L445 (i.e., the munge_start_date() function) is the culprit. from the docs:

When several format-orders are specified parse_date_time sorts the supplied format-orders based on a training set and then applies them recursively on the input vector.

I get s sense it's guessing wrong on the those ambiguous dates.

Possible remedy we sample the teststartdate column, infer the format for each sampled date, then set the date order fromat (ymd, mdy, etc) to be used by the parser explicitly by taking the most common format in the sample?

Thoughts @almartin82 https://github.com/almartin82?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/almartin82/mapvizieR/issues/321, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvvN4mdPE_N0AtSGpYQ7hM3EhqmMZPSks5r6yingaJpZM4NeLKn .

chrishaid commented 7 years ago

It's consistent across CDFs. My guess is that when you pull data down from your DB and you stored it as a date in your DB, then you are getting ymd back. Which makes sense.

Literally the problem was with two ambiguous dates out of hundreds.

so in my case, say you sample 20 dates. All 20 will mdy, but 19 might be parsed that way, with one ambigous date (12/01/2017) getting parsed as dmy. But a vote of the sample 19-1 for mdy would then set the parser to explicitely used the mdy ordering for every instance.

Does that makes sense?