ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory #9

Closed rossmounce closed 9 years ago

rossmounce commented 10 years ago

Tried scraping 90 urls from Frontiers but core dumped after ~69 of them! Not sure of the reproducibility of this bug, but I'll file it anyway...

quickscrape --urllist 90.txt --scraper journal-scrapers/generic_open.json

got stuck at: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full

I can see the 'rendered.html' & the 'results.json' files but no 'full' or 'pdf' so I guess it choked somehow when attempting to get those?

tail of terminal:

info:    waiting 20 seconds before next scrape
info:    processing URL: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full
data:    fulltext_pdf: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/pdf
data:    fulltext_html: http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full
data:    title: Bat guilds, a concept to classify the highly diverse foraging and echolocation behaviors of microchiropteran bats
data:    author: Denzinger, Annette
data:    author: Schnitzler, Hans-Ulrich
data:    date: 2013
data:    doi: 10.3389/fphys.2013.00164
data:    volume: 4
data:    description: Throughout evolution the foraging and echolocation behaviors as well as the motor systems of bats have been adapted to the tasks they have to perform while searching and acquiring food. When bats exploit the same class of environmental resources in a similar way, they perform comparable tasks and thus share similar adaptations independent of their phylogeny. Species with similar adaptations are assigned to guilds or functional groups. Habitat type and foraging mode mainly determine the foraging tasks and thus the adaptations of bats. Therefore we use habitat type and foraging mode to define seven guilds. The habitat types open, edge and narrow space are defined according to the bats’ echolocation behavior in relation to the distance between bat and background or food item and background. Bats foraging in the aerial, trawling, flutter detecting, or active gleaning mode use only echolocation to acquire their food. When foraging in the passive gleaning mode bats do not use echolocation but rely on sensory cues from the food item to find it. Bat communities often comprise large numbers of species with a high diversity in foraging areas, foraging modes, and diets. The assignment of species living under similar constraints into guilds identifies pattern of community structure and helps to understand the factors that underlie the organization of highly diverse bat communities. Bat species from different guilds do not compete for food as they differ in their foraging behavior and in the environmental resources they use. However, sympatric living species belonging to the same guild often exploit the same class of resources. To avoid competition they should differ in their niche dimensions. The fine grain structure of bat communities below the rather coarse classification into guilds is determined by mechanisms that result in niche partitioning.
info:    waiting for 2 downloads to complete in background
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)

The 90 URLs are here (order preserved, articles that contain the word 'phylogeny'):

http://journal.frontiersin.org/Journal/10.3389/fcimb.2012.00057/full http://journal.frontiersin.org/Journal/10.3389/fcimb.2012.00098/full http://journal.frontiersin.org/Journal/10.3389/fcimb.2012.00133/full http://journal.frontiersin.org/Journal/10.3389/fendo.2012.00131/full http://journal.frontiersin.org/Journal/10.3389/fendo.2012.00173/full http://journal.frontiersin.org/Journal/10.3389/fendo.2014.00072/full http://journal.frontiersin.org/Journal/10.3389/fevo.2013.00001/full http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00011/full http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00012/full http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00016/full http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00026/full http://journal.frontiersin.org/Journal/10.3389/fevo.2014.00027/full http://journal.frontiersin.org/Journal/10.3389/fgene.2011.00053/full http://journal.frontiersin.org/Journal/10.3389/fgene.2011.00069/full http://journal.frontiersin.org/Journal/10.3389/fgene.2011.00072/full http://journal.frontiersin.org/Journal/10.3389/fgene.2012.00301/full http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00004/full http://journal.frontiersin.org/Journal/10.3389/fimmu.2012.00024/full http://journal.frontiersin.org/Journal/10.3389/fimmu.2012.00136/full http://journal.frontiersin.org/Journal/10.3389/fimmu.2013.00122/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00053/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00063/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00090/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2011.00116/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00132/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00168/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00213/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00266/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00278/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00305/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00405/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2012.00444/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00084/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00095/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00151/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00190/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00192/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00217/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00291/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00322/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00330/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00366/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00381/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00413/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2013.00414/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00013/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00037/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00076/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00112/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00173/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00223/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00256/full http://journal.frontiersin.org/Journal/10.3389/fmicb.2014.00298/full http://journal.frontiersin.org/Journal/10.3389/fnana.2011.00007/full http://journal.frontiersin.org/Journal/10.3389/fnana.2012.00017/full http://journal.frontiersin.org/Journal/10.3389/fnana.2012.00050/full http://journal.frontiersin.org/Journal/10.3389/fncel.2013.00247/full http://journal.frontiersin.org/Journal/10.3389/fncir.2013.00178/full http://journal.frontiersin.org/Journal/10.3389/fnevo.2011.00002/full http://journal.frontiersin.org/Journal/10.3389/fnhum.2011.00053/full http://journal.frontiersin.org/Journal/10.3389/fnhum.2013.00245/full http://journal.frontiersin.org/Journal/10.3389/fnhum.2014.00345/full http://journal.frontiersin.org/Journal/10.3389/fnins.2011.00138/full http://journal.frontiersin.org/Journal/10.3389/fnins.2012.00118/full http://journal.frontiersin.org/Journal/10.3389/fnmol.2011.00052/full http://journal.frontiersin.org/Journal/10.3389/fnmol.2014.00048/full http://journal.frontiersin.org/Journal/10.3389/fnsys.2011.00073/full http://journal.frontiersin.org/Journal/10.3389/fphar.2012.00115/full http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00164/full http://journal.frontiersin.org/Journal/10.3389/fphys.2013.00342/full http://journal.frontiersin.org/Journal/10.3389/fpls.2011.00005/full http://journal.frontiersin.org/Journal/10.3389/fpls.2011.00011/full http://journal.frontiersin.org/Journal/10.3389/fpls.2011.00110/full http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00001/full http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00022/full http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00159/full http://journal.frontiersin.org/Journal/10.3389/fpls.2012.00227/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00250/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00261/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00327/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00367/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00377/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00386/full http://journal.frontiersin.org/Journal/10.3389/fpls.2013.00547/full http://journal.frontiersin.org/Journal/10.3389/fpls.2014.00179/full http://journal.frontiersin.org/Journal/10.3389/fpls.2014.00296/full http://journal.frontiersin.org/Journal/10.3389/fpsyg.2014.00163/full http://journal.frontiersin.org/Journal/10.3389/fpsyg.2014.00282/full http://journal.frontiersin.org/Journal/10.3389/fpubh.2014.00043/full http://journal.frontiersin.org/Journal/10.3389/neuro.02.026.2009/full

rossmounce commented 10 years ago

Just to clarify it was nodejs that crashed. Apport caught the crash. I'd send the .crash file only it's 260MB's!

blahah commented 10 years ago

is the .crash a text file? If so can you tail -1000 on it and send me that?

rossmounce commented 10 years ago

plaintext. Lots of garbage until the last 35 lines:

ZuEGNyUzY+elJyYjpeLCcEiZFvZDkNmFvk90sCt/XuUuy/3pv5++kyb5zJuo40c8WTv93K/r9gajvl4n1gwjRfeI0Gk8Sb4mJ6bGdo09KSm8jup+Tvr/zYpG2tzNP5E7Rl1Ki36iqRPUdMURP57/qlkPpqkmSYDRi5TUvizaQFLOfIDkScRgHEN1xxO3pHNx2Xrp3/qt0/hIt+MMZ06Zl6pNLpbMb/djUjNTRKekT6FFO+ihaIAipzTXNjiaHbYGQWiM+NUutXNLQvKSexo4mIbW6oSW1tapJOmZRZ8orhdSGRkeVkFpVU76wyVZfVV5T2aTkJAq2+toKIbWpUXyAR7Z7C1qqy0lTdWJlbXOFrYnQsNU5asWWWiochGqzUlDeVGWvs1VU1Vc1OIRUR5WT/FvRWC9l+xoGwT5ey+1T9Bo8A1H2+/z5hCDQ93YGqepxH2MF/ASNcnagVeHjfma4qk69L6rRsvsgQbWXVYcRdK/QiPi473lDq5wbqPnnY/FzXSdU+Liv8kNBhhCcf0F1bjBARU85p2HPZwRObhimyftz/ryGPacJhT+bw3cBvgvw+Y0Cz/88OAPA962U8yn2XCqGw8djskpofzTky+Ioflkcqz98+ziOdRz/mwF/cxyrJ6HG3wllyF874LcD/gUR8Fdy7e8F/L2AfzAC/v1Qhvv4LfEUf0u8cj4VDn8bh4/nTPhC2wluwHQc/nOAb+DK8RzyqhDtq+NgH7geAQ29FIH/aIiGaIiGaIiGaIiGaIiGczGY3/RapccIr74VDdEQDf9lgdj//qj9R0M0/Nfa/zhN1P6jIRqiIRqiIRr+q8L/A7o/r2IAgB5y ApportVersion: 2.14.1-0ubuntu3 Dependencies: dpkg 1.17.5ubuntu5 gcc-4.8-base 4.8.2-19ubuntu1 gcc-4.9-base 4.9-20140406-0ubuntu1 libacl1 2.2.52-1 libattr1 1:2.4.47-1ubuntu1 libbz2-1.0 1.0.6-5 libc6 2.19-0ubuntu6 libgcc1 1:4.9-20140406-0ubuntu1 liblzma5 5.1.1alpha+20120614-2ubuntu2 libpcre3 1:8.31-2ubuntu2 libreadline6 6.3-4ubuntu2 libselinux1 2.2.2-1 libstdc++6 4.8.2-19ubuntu1 libtinfo5 5.9+20140118-1ubuntu1 multiarch-support 2.19-0ubuntu6 readline-common 6.3-4ubuntu2 rlwrap 0.37-5 tar 1.27.1-1 zlib1g 1:1.2.8.dfsg-1ubuntu1 InstallationDate: Installed on 2014-04-27 (42 days ago) InstallationMedia: Lubuntu 14.04 LTS "Trusty Tahr" - Release amd64 (20140416.2) Package: nodejs 0.10.28-1chl1~trusty1 [origin: LP-PPA-chris-lea-node.js] PackageArchitecture: amd64 ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9 SourcePackage: nodejs Tags: third-party-packages trusty UnreportableReason: You have some obsolete package versions installed. Please upgrade the following packages and check if the problem still occurs:

dpkg, libselinux1 UpgradeStatus: No upgrade log present (probably fresh install) _MarkForUpload: True

rossmounce commented 10 years ago

head -100:

ProblemType: Crash Architecture: amd64 CurrentDesktop: LXDE Date: Mon Jun 9 14:47:41 2014 DistroRelease: Ubuntu 14.04 ExecutablePath: /usr/bin/nodejs ExecutableTimestamp: 1399084766 ProcCmdline: node /usr/bin/quickscrape --urllist 90.txt --scraper journal-scrapers/generic_open.json ProcCwd: /home/ross/workspace/quickscrape/output/http_journal.frontiersin.org_Journal_10.3389_fphys.2013.00164_full ProcEnviron: TERM=xterm SHELL=/bin/bash PATH=(custom, no user) LANG=en_GB.UTF-8 LANGUAGE=en_GB:en XDG_RUNTIME_DIR= ProcMaps: 00400000-00bf5000 r-xp 00000000 08:08 27787305 /usr/bin/nodejs 00df4000-00df5000 r-xp 007f4000 08:08 27787305 /usr/bin/nodejs 00df5000-00e0a000 rwxp 007f5000 08:08 27787305 /usr/bin/nodejs 00e0a000-00e12000 rwxp 00000000 00:00 0 02b0e000-064e2000 rwxp 00000000 00:00 0 [heap] 15ef000000-15ef100000 rwxp 00000000 00:00 0 2abda00000-2abdb00000 rwxp 00000000 00:00 0 33c8e00000-33c8f00000 rwxp 00000000 00:00 0 34de200000-34de300000 rwxp 00000000 00:00 0 3a93800000-3a93900000 rwxp 00000000 00:00 0 6f38000000-6f38100000 rwxp 00000000 00:00 0 73c2900000-73c2a00000 rwxp 00000000 00:00 0 7e4f400000-7e4f500000 rwxp 00000000 00:00 0 91d4600000-91d4700000 rwxp 00000000 00:00 0 9282d00000-9282e00000 rwxp 00000000 00:00 0 973d900000-973da00000 rwxp 00000000 00:00 0 a1f2e00000-a1f2f00000 rwxp 00000000 00:00 0 a87c500000-a87c600000 rwxp 00000000 00:00 0 b7b2900000-b7b2a00000 rwxp 00000000 00:00 0 bacba00000-bacbb00000 rwxp 00000000 00:00 0 c4be300000-c4be400000 rwxp 00000000 00:00 0 cb1a700000-cb1a800000 rwxp 00000000 00:00 0 cd03d00000-cd03e00000 rwxp 00000000 00:00 0 da24900000-da24a00000 rwxp 00000000 00:00 0 ddc8c00000-ddc8d00000 rwxp 00000000 00:00 0 e332200000-e332300000 rwxp 00000000 00:00 0 10062b00000-10062c00000 rwxp 00000000 00:00 0 10c48100000-10c48200000 rwxp 00000000 00:00 0 116b3c00000-116b3d00000 rwxp 00000000 00:00 0 11a64e00000-11a64f00000 rwxp 00000000 00:00 0 11d2f800000-11d2f900000 rwxp 00000000 00:00 0 13360c00000-13360d00000 rwxp 00000000 00:00 0 13560200000-13560300000 rwxp 00000000 00:00 0 144fc300000-144fc400000 rwxp 00000000 00:00 0 1518f700000-1518f800000 rwxp 00000000 00:00 0 154ef600000-154ef700000 rwxp 00000000 00:00 0 15d35a00000-15d35b00000 rwxp 00000000 00:00 0 15f09a00000-15f09b00000 rwxp 00000000 00:00 0 16ce1f00000-16ce2000000 rwxp 00000000 00:00 0 17dfad00000-17dfae00000 rwxp 00000000 00:00 0 1982f100000-1982f200000 rwxp 00000000 00:00 0 19f96b00000-19f96c00000 rwxp 00000000 00:00 0 1a3d7f00000-1a3d8000000 rwxp 00000000 00:00 0 1a63f100000-1a63f200000 rwxp 00000000 00:00 0 1bcc0f00000-1bcc1000000 rwxp 00000000 00:00 0 1ee63600000-1ee63700000 rwxp 00000000 00:00 0 1ee6c400000-1ee6c500000 rwxp 00000000 00:00 0 1ff8ca00000-1ff8cb00000 rwxp 00000000 00:00 0 208cb800000-208cb900000 rwxp 00000000 00:00 0 20c16c00000-20c16d00000 rwxp 00000000 00:00 0 211ea700000-211ea800000 rwxp 00000000 00:00 0 2202d000000-2202d100000 rwxp 00000000 00:00 0 22722100000-22722200000 rwxp 00000000 00:00 0 229df500000-229df600000 rwxp 00000000 00:00 0 22d5c300000-22d5c400000 rwxp 00000000 00:00 0 236e9f00000-236ea000000 rwxp 00000000 00:00 0 24c19900000-24c19a00000 rwxp 00000000 00:00 0 24d15000000-24d15100000 rwxp 00000000 00:00 0 258aee00000-258aef00000 rwxp 00000000 00:00 0 25fda800000-25fda900000 rwxp 00000000 00:00 0 2674e000000-2674e100000 rwxp 00000000 00:00 0 29570d00000-29570e00000 rwxp 00000000 00:00 0 29f2bc00000-29f2bd00000 rwxp 00000000 00:00 0 2acb2500000-2acb2600000 rwxp 00000000 00:00 0 2c4ba200000-2c4ba300000 rwxp 00000000 00:00 0 2cd66700000-2cd66800000 rwxp 00000000 00:00 0 2d9ace00000-2d9acf00000 rwxp 00000000 00:00 0 2ecbb500000-2ecbb600000 rwxp 00000000 00:00 0 2f9a2d00000-2f9a2e00000 rwxp 00000000 00:00 0 2ff8e800000-2ff8e900000 rwxp 00000000 00:00 0 314c6500000-314c6600000 rwxp 00000000 00:00 0 3234c100000-3234c200000 rwxp 00000000 00:00 0 33875400000-33875500000 rwxp 00000000 00:00 0 33897200000-33897300000 rwxp 00000000 00:00 0 341ecb00000-341ecc00000 rwxp 00000000 00:00 0 35364500000-35364600000 rwxp 00000000 00:00 0 3545f800000-3545f900000 rwxp 00000000 00:00 0 3612fa00000-3612fb00000 rwxp 00000000 00:00 0 3690e600000-3690e700000 rwxp 00000000 00:00 0 373f3200000-373f3300000 rwxp 00000000 00:00 0 38261f00000-38262000000 rwxp 00000000 00:00 0 387cb700000-387cb800000 rwxp 00000000 00:00 0 390a3300000-390a3400000 rwxp 00000000 00:00 0

blahah commented 10 years ago

Hmm, nothing in there that sheds any light, I'll try running it myself

blahah commented 10 years ago

I suspect this is to do with a known node bug: https://github.com/joyent/node/pull/3076. Best we can do is require Node >= v0.8.14 - I'll bump the requirement.