CaoZ / Fast-LianJia-Crawler

直接通过链家 API 抓取数据的极速爬虫,宇宙最快~~ 🚀
286 stars 99 forks source link

不同商圈之间有重复小区, communites表id唯一导致数据插入失败 #5

Open wzrzt opened 5 years ago

wzrzt commented 5 years ago

由于不同商圈之间可能有重复的社区,但是communities表id不能重复,导致抓取第二个商圈时就运行失败了,所以需要做一下处理。仿照删除商圈id的做法,在小区信息插入数据库之前加入了一行删除已有社区id的代码。如下所示,Main.py 163行update_db函数。

def update_db(db_session, biz_circle, communities):
    """
    更新小区信息, 商圈信息
    """
    db_session.query(Community).filter(
        Community.biz_circle_id == biz_circle.id
    ).delete()

    for community_info in communities['list']:
        try:
            district_id = DISTRICT_MAP[community_info['district_name']]
            community = Community(biz_circle.city_id, district_id, biz_circle.id, community_info)

            db_session.query(Community).filter(
                Community.id == community.id
            ).delete()

            db_session.add(community)
        except Exception as e:
            # 返回的信息可能是错误的/不完整的, 如小区信息失效后返回的是不完整的信息
            # 如: http://sz.lianjia.com/xiaoqu/2414168277659446
            logging.error('错误: 小区 id: {}; 错误信息: {}'.format(community_info['community_id'], repr(e)))

    biz_circle.communities_count = communities['count']
    biz_circle.communities_updated_at = datetime.now()

    db_session.commit()