ak--47 / toMixpanel

🚰 ETL data into Mixpanel from many sources
13 stars 3 forks source link

JSON parse error due to "let" scope #2

Closed alanflickgames closed 2 years ago

alanflickgames commented 2 years ago

When attempting to migrate data from Amplitude to Mixpanel following the Youtube tutorial for this tool, I encountered the following error:

LOAD!
   events:

       failed to parse data... only valid JSON or NDJSON is supported by this script
ReferenceError: file is not defined
    at main (file:///Users/me/tomixpanel/load/sendEventsToMixpanel.js:69:17)
    at async amplitudeETL (file:///Users/me/tomixpanel/connectors/amplitudeETL.js:46:30)
    at async main (file:///Users/me/tomixpanel/index.js:69:13)

This seems to be because file is locally scoped on line 37 of sendEventsToMixpanel.js

Adding let file = ""; to the start of the function and removing the keyword let on line 37 seems to fix the issue.

ak--47 commented 2 years ago

@alanflickgames ack! i'm so sorry i didn't see this until now.

the whole 'send events to mixpanel' implementation has been completely replaced and overhauled to use streams... this is both faster and more reliable. https://github.com/ak--47/toMixpanel/blob/main/connectors/amplitudeETL.js#L62-L64

can you git pull to get the latest code and let me know if you're still seeing the issue?

thanks for writing in!

alanflickgames commented 2 years ago

No worries, as I say I had it working by making the change suggested. I think the implementation I was using was already the version using streams (I only started using it approx 3 weeks ago), but I've just pulled the latest and ran a test and all looks good. Incidentally, I reported another issue to Mixpanel support. When importing a single day of data (which I was doing when initially testing), it would only pull in a single zip file for the day - whereas Amplitude packages the data so that there is at least one zip per hour.
The issue was that the start and end time expects there to be an hour parameter given, like so:

"start_date": "2019-12-28T0",
"end_date": "2019-12-28T23"

However the docs simply specify to add the dates:

 "start_date": "2019-12-28",
 "end_date": "2019-12-28"

I'm not sure if multi-date transfers behave the same way yet, but it might be worth adding a default T0 to the start_date and T23 to the end date if no time value is provided.

ak--47 commented 2 years ago

@alanflickgames great to hear from you!

yes, i tried to keep the config as simple as possible using simple YYYY-MM-DD date formats for the examples.

under the hood, i already convert start_date and end_date to be inclusive, so they become startT00 => endT23 like you suggest: https://github.com/ak--47/toMixpanel/blob/main/connectors/amplitudeETL.js#L21-L26

however, it is perfectly fine to pass the THH values explicitly in your config, and the script will respect them:

//just ETL a single hour's worth of data
 "start_date": "2019-12-28T04",
 "end_date": "2019-12-28T05"

(one of the great benefits of the dayjs lib is that is handles all manner of date times! so other date formats will probably work too!)

given your inquiry about multi-date transfers, it's worth mentioning that Amplitude's /export API has a hard 4GB limit on the amount of data it will return.

As a result, attempting an ETL with a large time range (or lots of data) can fail to /export

to accommodate this, I've written a quick and dirty "replicator" which replicates an existing config many times into hourly chunks between start_date and end_date https://github.com/ak--47/toMixpanel/blob/main/ampReplicator.js

the output of node ampReplicator.js ./myConfig.json ./myOutputDir also creates a bash script and a parallels script which is used to 'loop' over all the config files, so you can kickoff the full job with a single command: replicator

closing this comment for now (as i believe it's resolved) but please feel free to raise other issues/concerns/feature requests.

i mostly developed this tool for our internal teams to assist with data migrations; i open sourced it so anyone could take advantage. so glad to see you putting it to use! <3

alanflickgames commented 2 years ago

Thanks for the info about the 4GB limit. I've just tried out the replicator, but have hit an error when running the resulting runAll.sh.
It seems to immediately rename all of the json files so they end ".json.txt", which then leads to the following error for each of the files:

node:internal/process/promises:265
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: ENOENT: no such file or directory, open '/Users/me/mixpanelFromAmplitude/logs/ampmixconfig-3.json'] {
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/Users/me/mixpanelFromAmplitude/logs/ampmixconfig-3.json'
}

Is there another step that I've missed?

Edit: installing parallel and using that method seemed to work, it's just the runAll.sh script that had the issue above.

alanflickgames commented 2 years ago

Sorry to keep expanding on this, but I think I've spotted a few more errors in the replicator.

These lines work out the number of configs to split into, by taking the number of days and multiplying by the number of hours in a day:

let fullDateDelta = end.diff(start, 'd') * 24;
let numOfConfigsNeeded = Math.ceil(fullDateDelta)

However, each file that is created actually fetches the data for a 2 hour period, because the hour component is inclusive. For example "start_date": "2020-03-01T02, "end_date": "2020-03-01T03" will fetch all data from 02:00 until 03:59

The next file will be T04 to T05 etc.

This means that numOfConfigsNeeded ends up defining twice as many files as it needs, and so ends up doubling the number of days in the date range.

My initial test was for: "start_date": "2020-01-01, "end_date": "2020-01-31" But I ended up with data for both Jan and Feb being exported/imported without realising why.

When I've tested the replicator again using: "start_date": "2020-03-01, "end_date": "2020-03-31" The output json files end up covering a period from 2020-03-01T00 to 2020-04-30T00

To get the correct number of entries I think it just needs these changes:

  1. The fullDateDelta doesn't account for the fact that you need to count ALL days in the range. The difference between 31 and 1 is 30, but you would expect to see 31 days of data if you ended 2020-03-01 to 2020-03-31 so 1 day needs to be added on: let fullDateDelta = (end.diff(start, 'd')+1) * 24;

  2. Halve the number of configs needed, to account for the fact that each one covers 2 hours let numOfConfigsNeeded = Math.ceil(fullDateDelta) / 2 (or just multiply by 12 in the first step)

After making those changes I seem to get the correct values in the output json files.

It will also spill over into the first hour of the day after "end_date" if there is a daylight savings change in the period specified, but I'm not sure how best to deal with that.