dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

WARCFileWriter throws IOException if file already exists #13

Closed habernal closed 8 years ago

habernal commented 8 years ago

Method createSegment() should create a new segment (file) and not override the existing one; however, this is not the case on S3.

This should be updated

FSDataOutputStream fsStream = (progress == null) ?
                fs.create(path, false) :
                fs.create(path, progress);

as fs.create(path, false) sometimes throws

Error: java.io.IOException: File already exists:s3://ukp-research-data/c4corpus/cc-phase1out-2016-07/part-r-00000.seg-00000.warc.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:634) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:912) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:893) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:790) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:182) at de.tudarmstadt.ukp.dkpro.c4corpus.hadoop.io.WARCFileWriter.createSegment(WARCFileWriter.java:152) at
...
habernal commented 8 years ago

Hopefully fixed by 4cc9a81 Hard to reproduce on small data, reopen if fails on CommonCrawl in the future.

habernal commented 8 years ago

This should be investigated more in detail. Two scenarios:

habernal commented 8 years ago

Fix confirmed; ran on entire CommonCrawl without a single failure.