IamTheFij / email-assistant

Mirror of (https://git.iamthefij.com/iamthefij/email-assistant)
MIT License
8 stars 2 forks source link

Updating status of parcel tracking #1

Open Phyks opened 6 years ago

Phyks commented 6 years ago

Hi,

It seems from the web UI (viewer) that you planned at some point to automatically fetch the latest status for all shipping numbers found in the emails.

Not sure if it might help, but I know about Web Outside of Browser which is a Python collection of modules to fetch (scrap) data from websites. For instance, they already have a module for UPS (https://git.weboob.org/weboob/devel/tree/master/modules/ups) and DHL (https://git.weboob.org/weboob/devel/tree/master/modules/dhl).

I might actually try to bind the two of them on my setup at some point, let me know if you are interested :)

Phyks commented 6 years ago

Oops, just found out you are actually already using Trackerific for this. But it requires credentials (and I had not configured any ^^). So the above solution might actually be a good solution when the tracking URL is public. :)

IamTheFij commented 6 years ago

Not sure why I'm not getting notifications when these come up, but I just saw this.

Trackerific is doing a good job so far, but it would be cool to have some way to do this without requiring a credentials. Also, there are some areas where I'm having issues, such as Amazon shipping emails. Weboob could help with that.

Also, I've got on my list of things to add is support for flight tracking. There aren't great APIs for that, and definitely no free ones that I could find. Another possibly good use case.

Thanks for sharing!

Phyks commented 6 years ago

Also, I've got on my list of things to add is support for flight tracking.

I do support it through microdata email markup in my fork. Sadly, this seemed to be widely used a few years ago, but is no longer used (at least on European companies I travel with). :/

For now, I turned mostly to adding extra crawlers by fetching from my online accounts directly, either using Weboob (crawler is here) or CozyCloud connectors (crawler is WIP). This has the extra advantage of reducing false positives (I got some false positives with tracking parsers). Might be worth noting I'm indexing schema.org schemas, for standardization across parsers and crawlers.

Feel free to take whatever is interesting for you from my fork. We could even consider a merge back if you find it interesting.

There aren't great APIs for that, and definitely no free ones that I could find.

Indeed, but even more generally, I did not find many great APIs (even paid ones) to handle emails and extract meaningful info from them (using machine learning techniques). Inbox from Google is doing it nicely, but there seems to be no reusable such APIs :/

IamTheFij commented 6 years ago

I do support it through microdata email markup in my fork. Sadly, this seemed to be widely used a few years ago, but is no longer used (at least on European companies I travel with). :/

Yea, I was trying to figure out how Google extracts data for Gmail, but I gave up.

For now, I turned mostly to adding extra crawlers by fetching from my online accounts directly

The idea of adding additional crawlers is cool! I hadn't even considered crawling account pages in addition to my imap inbox.

I did not find many great APIs (even paid ones) to handle emails and extract meaningful info from them

What I did find was APIs for returning flight tracking info, but they were something like .5c per request. I'd have to extract tracking numbers from emails using regex of sorts and then hit the API and see if it's valid.

I'm definitely interested in contributions. If you've got parts you think may be useful to send back as patches, that'd be great. I can help with writing the Dockerfiles. One of the nice things about the implementation with microservices is that functionality is trivial to enable/disable. If I don't want to run the weboob crawler myself yet, I can just take it out of my compose file and everything else should still function. Same with new parsers and viewers.

If you would like to send PRs, it'd be great to split and try to contribute the pieces incrementally. It'll be easier for me to test the merges and verify. Since this is a small project and only a tiny community (just us two so far), I'd like to try and accommodate and prevent a hard fork that would split our contributions.

Phyks commented 6 years ago

Yea, I was trying to figure out how Google extracts data for Gmail, but I gave up.

It seems to me they started with parsing microdata from the emails, but they are no longer relying on this. Not sure if they have a bunch of ad-hoc scrapers for the emails or they are simply doing machine-learning stuff on the emails, similar to what https://developer.edison.tech/sift claims to be doing (but I gave a try to this API and it gave terrible results).

What I did find was APIs for returning flight tracking info, but they were something like .5c per request. I'd have to extract tracking numbers from emails using regex of sorts and then hit the API and see if it's valid.

Getting the flight tracking info might not be a huge problem. For personal use case, I think scraping data from flight tracking websites might be more than enough. I think this part (flight tracking / checking) could be handled quite easily for free. The difficult part in my opinion might be extracting the flight number and details from the email.

If you would like to send PRs, it'd be great to split and try to contribute the pieces incrementally. It'll be easier for me to test the merges and verify. Since this is a small project and only a tiny community (just us two so far), I'd like to try and accommodate and prevent a hard fork that would split our contributions.

Sure, I'll try to clean and PR my system. Feel free to discard anything that might not be interesting to you. Is it fine if I PR these on Github (or do you prefer to use your own hosted repo?)?

I was thinking about organization of the repo and microservices. Maybe it would make sense to have a Github organization hosting all the microservices in dedicated repos and a main entry point with the instructions and global Docker compose? This might make it easier to support extra community-driven microservices, a bit as Yunohost is doing for their official apps and community apps. This might be a bit overkilll at the moment though :)