!!!一些细节
chengjiao_spider.py
设置了总页数最大为 30 页,因为担心页数太多爬太久。getSingleArea.py
,可以设置爬取某城某区的某个区域。通过 if district == 'wuzhong'
设置。combined_df['成交价'] = combined_df['成交价'].apply(lambda x: x / 10000 if x > 1000000 else x)
combined_df['挂牌价'] = combined_df['挂牌价'].apply(lambda x: x / 10000 if x > 1000000 else x)
combined_df['成交单价'] = combined_df['成交单价'].apply(lambda x: x / 10000 if x > 500000 else x)
combined_df['挂牌单价'] = combined_df['挂牌单价'].apply(lambda x: x / 10000 if x > 1000000 else x)
hz: 杭州, sz: 深圳, dl: 大连, fs: 佛山
xm: 厦门, dg: 东莞, gz: 广州, bj: 北京
cd: 成都, sy: 沈阳, jn: 济南, sh: 上海
tj: 天津, qd: 青岛, cs: 长沙, su: 苏州
cq: 重庆, wh: 武汉, hf: 合肥, yt: 烟台
nj: 南京,
Total crawl 207 areas.
Total cost 294.048109055 second to crawl 27256 data items.
Total crawl 215 areas.
Total cost 1028.3090899 second to crawl 75448 data items.
Total crawl 215 areas.
Total cost 299.7534770965576 second to crawl 32735 data items.
Total crawl 400 loupan.
Total cost 29.757128953933716 second