baturin / wikivoyage-listings

Data extracted from Wikivoyage, the free travel guide at http://wikivoyage.org. Leverage Wikivoyage listings on your smartphone, or in your own mashups.
http://wvpoi.batalex.ru/
Other
48 stars 27 forks source link

Aborting during download/extract leaves a broken file #60

Closed nicolas-raoul closed 6 years ago

nicolas-raoul commented 6 years ago

Steps:

  1. Execute the script
  2. Abort (for instance with CTRL-C)
  3. Execute again
  4. Error appears:
    [2018-02-27 10:53:14] Use cached dump                                             
    [2018-02-27 10:53:14] Parse dump                                                   
    [2018-02-27 10:53:14] Save to '../wikivoyage.github.io/wikivoyage-listings-fr.csv'
    Failure                                                                                                  
    org.wikivoyage.listings.input.DumpReadException: Failed to get article in Wikivoyage dump: error when reading XML
        at org.wikivoyage.listings.input.DumpArticlesIterator.getNext(DumpArticlesIterator.java:189)
        at org.wikivoyage.listings.input.DumpArticlesIterator.next(DumpArticlesIterator.java:59)
        at org.wikivoyage.listings.input.DumpListingsIterator.getNext(DumpListingsIterator.java:40)
        at org.wikivoyage.listings.input.DumpListingsIterator.next(DumpListingsIterator.java:57)
        at org.wikivoyage.listings.input.DumpListingsIterator.next(DumpListingsIterator.java:17)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:41)
        at org.wikivoyage.listings.validators.SimpleValidator$SimpleValidatorIterator.next(SimpleValidator.java:27)
        at org.wikivoyage.listings.validators.WikidataValidator$WikidataValidatorIterator.validateNextBatch(WikidataValidator.java:60)
        at org.wikivoyage.listings.validators.WikidataValidator$WikidataValidatorIterator.next(WikidataValidator.java:51)
        at org.wikivoyage.listings.validators.WikidataValidator$WikidataValidatorIterator.next(WikidataValidator.java:35)
        at org.wikivoyage.listings.output.CSV.write(CSV.java:60)
        at org.wikivoyage.listings.Main.generateFileForFormat(Main.java:235)
        at org.wikivoyage.listings.Main.main(Main.java:98)
    Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[197812,34]
    Message: unexpected end of stream
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:599)
        at org.wikivoyage.listings.input.DumpArticlesIterator.getNext(DumpArticlesIterator.java:183)
        ... 18 more

Workaround: Remove all files in the dumps-cache folder.

The tool should download/extract under a temporary name, and only give the final filename after the extraction finishes (successfully).

nicolas-raoul commented 6 years ago

Thanks @olgfok for the pull request!

I think there is still a case where it can fail: When the program is interrupted during decompression. That's probably what happened for the log above. The XML was broken because the downloaded file had not been completely uncompressed.

olgfok commented 6 years ago

@nicolas-raoul, thanks for the comment. I think the error from the log happened, because the first time execution was aborted during downloading. If you look at how the file is decompressed, you'll see that it's done in memory by using BZip2CompressorInputStream. So if you interrupt the process during decompression, the compressed file itself remains undamaged and next time execution will decompress the very same file, that was downloaded.

nicolas-raoul commented 6 years ago

OK thanks for investigating and thanks for the fix! :-)