IKANOW / Aleph2

The IKANOW v2 meta-database and analytics platform
Apache License 2.0
2 stars 1 forks source link

Hadoop/spark input missing segments #94

Open Alex-Ikanow opened 8 years ago

Alex-Ikanow commented 8 years ago

From Caleb:

2016-05-16 10:10:16 [ForkJoinPool.commonPool-worker-14] INFO  BeJobLauncher:192 - Adding data service path for bucket /backup/ttm/master: service_name=Aleph2EsInputFormat options={es.resource=ttm_master__2b5fd691353a/type_1,stats, es.index.read.missing.as.empty=yes, es.query=?q=*}

for a bucket that is comprised of two indexes (ttm_master__2b5fd691353a and ttm_master__2b5fd691353a_1)

I think the problem is when the tmin/tmax are specified, it tries to filter on date, my guess is that it's picking up the segment id as a date and then ignoring it

                final String final_index = getTimedIndexes(job_input, index_type_mapping, new Date())
                                                .map(s -> Stream.concat(s, TimeSliceDirUtils.getUntimedDirectories(index_type_mapping.keySet().stream()))
                                                            .collect(Collectors.joining(",")))
                                            .orElse(index_resource);                        

Ah looks like candidateTimedDirectories in TimeSliceDirUtils in aleph2_core_shared_library doesn't expect the segment id (which is a purely ES construct)

Alex-Ikanow commented 8 years ago

The problem is that the ES-hadoop/spark logic doesn't try to figure out if indexes are timesliced except by their suffix .. if course in practice we're guaranteed that _ starts with 4 digits (year) and if there are >999 segments then we have other problems

So should be safe to add a number-of-digit-at-the-end check to the TimeSliceUtils code