DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Update to use Python 3.5 #48

Open janetriley opened 8 years ago

janetriley commented 8 years ago

Update the project to be compatible with Python 3.5 so we have the option to use asyncio.

bbengfort commented 8 years ago
janetriley commented 8 years ago

In addition there are references to 2.7 in

will2041 commented 8 years ago

Pull request that includes code changes and Travis update.

will2041 commented 8 years ago

Have another pull request out that takes care of the documentation updates and removes an unused method. Only work after that will be any updates needed on the docker front.

bbengfort commented 8 years ago

@will2041 is this complete now? I'm using Python 3.5 and everything seems to be fine.

will2041 commented 8 years ago

Basically. The Docker file still uses 2.7 and has some references, but that's it I think. I did run into some weird behavior with the export command using bin/baleen, but I'm not sure it's a 3.5 problem.

This could probably be closed. I think there's a separate item for Docker updates.

bbengfort commented 8 years ago

@will2041 -- ok this can be closed; I'm just hesitant to actually push to production, especially since things have been running so well! We may have to find a time where we're both available to try to do the release together and push to production - any thoughts when?

will2041 commented 8 years ago

I suppose a weekend is easiest to coordinate schedules. I'm free Saturday, but after that I've got visitors/am travellng until after Labor Day.

bbengfort commented 8 years ago

So uh, I guess you meant this Saturday? I guess it'll have to keep until after labor day then! Sorry about that. Want to get something on the calendar?

will2041 commented 8 years ago

Yeah, let's schedule something. I sent you an invite for the 10th. Maybe if Labor Day weekend ends up being free we could move it up, but I probably won't know that until the last minute.

bbengfort commented 8 years ago

So the 10th I'm teaching -- though I could do it later in the evening; and like I said, I'll be driving back from North Carolina on the 17th; so 24th? Labor Day weekend could work.

Ben

On Mon, Aug 15, 2016 at 10:56 PM, will2041 notifications@github.com wrote:

Yeah, let's schedule something. I sent you an invite for the 10th. Maybe if Labor Day weekend ends up being free we could move it up, but I probably won't know that until the last minute.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bbengfort/baleen/issues/48#issuecomment-239988576, or mute the thread https://github.com/notifications/unsubscribe-auth/AAth7jNRb1AIz4FG6IBoRMS8mrmpRo0yks5qgSb5gaJpZM4IszOx .

will2041 commented 8 years ago

Ha! Oh man, now we're just pushing it out crazy far. Sundays work? 11th? My parents are in town that weekend of the 24th... How about some evening during the week? I could maybe swing that.

bbengfort commented 8 years ago

This is my life - scheduling months in advance; seriously ...

Sundays do work but in the evenings for me, not the morning. Want to do the 11th anytime after 2pm EST?

will2041 commented 8 years ago

Sundays are wonderful. I updated the invite to 3PM EST on the 11th.

bbengfort commented 8 years ago

Perfect, we figured it out!

will2041 commented 7 years ago

Updates before push:

Master branch change - https://github.com/bbengfort/baleen/blob/master/baleen/exceptions.py#L69 TimeoutError is already a built in OS error

will2041 commented 7 years ago

Example of current error when running ingestion locally:

baleen.ingest INFO [11/Sep/2016:12:41:27 -0700] -- MongoIngestor job baf0c464-7857-11e6-89aa-60f81dac6496 started baleen.ingest ERROR [11/Sep/2016:12:41:38 -0700] -- Post Error for feed Washington Post: Breaking News, World, US, DC News & Analysis on entry 4: Tried to save duplicate unique keys (E11000 duplicate key error collection: baleen.posts index: url_1 dup key: { : "https://www.washingtonpost.com/politics/clinton-holds-lead-over-trump-in-new-poll-but-warning-signs-emerge/2016/09/10/800dee0c-76c8-11e6-b786-19d0cb1e..." }) baleen.ingest ERROR [11/Sep/2016:12:41:38 -0700] -- Post Error for feed Washington Post: Breaking News, World, US, DC News & Analysis on entry 6: 'NoneType' object has no attribute 'encode' <<SKIPPED 59 more entries like above and below lines>> baleen.ingest ERROR [11/Sep/2016:12:41:41 -0700] -- Post Error for feed Washington Post: Breaking News, World, US, DC News & Analysis on entry 78: 'NoneType' object has no attribute 'encode' baleen.ingest ERROR [11/Sep/2016:12:41:57 -0700] -- Ingestion Error: 'PostWrangler' object has no attribute 'title' baleen.ingest CRITICAL [11/Sep/2016:12:41:57 -0700] -- MongoIngestor job baf0c464-7857-11e6-89aa-60f81dac6496 failed!

will2041 commented 7 years ago

Well, I got it mostly working. Only weird thing is that the output has some errors:

Processed 35 (1 unchanged) feeds (5 minutes 25 seconds): 659 posts with 62 errors

52 errors are:

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url

But then there are 10 like this:

baleen.ingest ERROR [11/Sep/2016:16:43:44 -0700] -- Post Error for feed Washington Post: Breaking News, World, US, DC News & Analysis on entry 70: Post {'wp_uuid': 'd0905852-6eb7-11e5-b31c-d80d62b53e28', 'title_detail': {'value': 'This weekend’s open houses in D.C., Maryland, Virginia', 'base': 'http://feeds.washingtonpost.com/rss/homepage', 'language': None, 'type': 'text/plain'}, 'content': None, 'pubdate': None, 'url': 'https://www.washingtonpost.com/realestate/this-weekends-open-houses-in-dc-maryland-virginia/2015/10/09/d0905852-6eb7-11e5-b31c-d80d62b53e28_story.html', 'tags': [], 'title': 'This weekend’s open houses in D.C., Maryland, Virginia', 'links': [{'href': 'https://www.washingtonpost.com/realestate/this-weekends-open-houses-in-dc-maryland-virginia/2015/10/09/d0905852-6eb7-11e5-b31c-d80d62b53e28_story.html', 'rel': 'alternate', 'type': 'text/html'}], 'guidislink': False} does not contain any content

The logging is new. We were failing on saving to Mongo because the content field was None and we couldn't encode that to get a unique hash. Now we don't fail, but these contentless posts just disappear (using what I have in my workspace).

will2041 commented 7 years ago

Updating status months later:

Next steps:

will2041 commented 7 years ago

Tried everything out with Python 3.6 locally and it all seems to work. I'm going to switch gears and look into deployment to see if I can get all this running (and repeatable/documented).