Closed alanflickgames closed 2 years ago
@alanflickgames ack! i'm so sorry i didn't see this until now.
the whole 'send events to mixpanel' implementation has been completely replaced and overhauled to use streams... this is both faster and more reliable. https://github.com/ak--47/toMixpanel/blob/main/connectors/amplitudeETL.js#L62-L64
can you git pull
to get the latest code and let me know if you're still seeing the issue?
thanks for writing in!
No worries, as I say I had it working by making the change suggested. I think the implementation I was using was already the version using streams (I only started using it approx 3 weeks ago), but I've just pulled the latest and ran a test and all looks good.
Incidentally, I reported another issue to Mixpanel support. When importing a single day of data (which I was doing when initially testing), it would only pull in a single zip file for the day - whereas Amplitude packages the data so that there is at least one zip per hour.
The issue was that the start and end time expects there to be an hour parameter given, like so:
"start_date": "2019-12-28T0",
"end_date": "2019-12-28T23"
However the docs simply specify to add the dates:
"start_date": "2019-12-28",
"end_date": "2019-12-28"
I'm not sure if multi-date transfers behave the same way yet, but it might be worth adding a default T0 to the start_date and T23 to the end date if no time value is provided.
@alanflickgames great to hear from you!
yes, i tried to keep the config as simple as possible using simple YYYY-MM-DD
date formats for the examples.
under the hood, i already convert start_date
and end_date
to be inclusive, so they become startT00
=> endT23
like you suggest:
https://github.com/ak--47/toMixpanel/blob/main/connectors/amplitudeETL.js#L21-L26
however, it is perfectly fine to pass the THH
values explicitly in your config, and the script will respect them:
//just ETL a single hour's worth of data
"start_date": "2019-12-28T04",
"end_date": "2019-12-28T05"
(one of the great benefits of the dayjs
lib is that is handles all manner of date times! so other date formats will probably work too!)
given your inquiry about multi-date transfers, it's worth mentioning that Amplitude's /export
API has a hard 4GB limit on the amount of data it will return.
As a result, attempting an ETL with a large time range (or lots of data) can fail to /export
to accommodate this, I've written a quick and dirty "replicator" which replicates an existing config many times into hourly chunks between start_date
and end_date
https://github.com/ak--47/toMixpanel/blob/main/ampReplicator.js
the output of node ampReplicator.js ./myConfig.json ./myOutputDir
also creates a bash script and a parallels script which is used to 'loop' over all the config files, so you can kickoff the full job with a single command:
closing this comment for now (as i believe it's resolved) but please feel free to raise other issues/concerns/feature requests.
i mostly developed this tool for our internal teams to assist with data migrations; i open sourced it so anyone could take advantage. so glad to see you putting it to use! <3
Thanks for the info about the 4GB limit. I've just tried out the replicator, but have hit an error when running the resulting runAll.sh.
It seems to immediately rename all of the json files so they end ".json.txt", which then leads to the following error for each of the files:
node:internal/process/promises:265
triggerUncaughtException(err, true /* fromPromise */);
^
[Error: ENOENT: no such file or directory, open '/Users/me/mixpanelFromAmplitude/logs/ampmixconfig-3.json'] {
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: '/Users/me/mixpanelFromAmplitude/logs/ampmixconfig-3.json'
}
Is there another step that I've missed?
Edit: installing parallel and using that method seemed to work, it's just the runAll.sh script that had the issue above.
Sorry to keep expanding on this, but I think I've spotted a few more errors in the replicator.
These lines work out the number of configs to split into, by taking the number of days and multiplying by the number of hours in a day:
let fullDateDelta = end.diff(start, 'd') * 24;
let numOfConfigsNeeded = Math.ceil(fullDateDelta)
However, each file that is created actually fetches the data for a 2 hour period, because the hour component is inclusive. For example "start_date": "2020-03-01T02, "end_date": "2020-03-01T03" will fetch all data from 02:00 until 03:59
The next file will be T04 to T05 etc.
This means that numOfConfigsNeeded ends up defining twice as many files as it needs, and so ends up doubling the number of days in the date range.
My initial test was for: "start_date": "2020-01-01, "end_date": "2020-01-31" But I ended up with data for both Jan and Feb being exported/imported without realising why.
When I've tested the replicator again using: "start_date": "2020-03-01, "end_date": "2020-03-31" The output json files end up covering a period from 2020-03-01T00 to 2020-04-30T00
To get the correct number of entries I think it just needs these changes:
The fullDateDelta doesn't account for the fact that you need to count ALL days in the range. The difference between 31 and 1 is 30, but you would expect to see 31 days of data if you ended 2020-03-01 to 2020-03-31 so 1 day needs to be added on:
let fullDateDelta = (end.diff(start, 'd')+1) * 24;
Halve the number of configs needed, to account for the fact that each one covers 2 hours
let numOfConfigsNeeded = Math.ceil(fullDateDelta) / 2
(or just multiply by 12 in the first step)
After making those changes I seem to get the correct values in the output json files.
It will also spill over into the first hour of the day after "end_date" if there is a daylight savings change in the period specified, but I'm not sure how best to deal with that.
When attempting to migrate data from Amplitude to Mixpanel following the Youtube tutorial for this tool, I encountered the following error:
This seems to be because file is locally scoped on line 37 of sendEventsToMixpanel.js
Adding
let file = "";
to the start of the function and removing the keywordlet
on line 37 seems to fix the issue.