fingltd / 4mc

4mc - splittable lz4 and zstd in hadoop/spark/flink
Other
108 stars 36 forks source link

Multiple changes running 4mz in Spark 2.2 #27

Closed snoe925 closed 6 years ago

snoe925 commented 6 years ago

Modern Hadoop does not require core-site.xml configurations for codecs.

This allows the codec to work in Spark by adding the jar to the classpath. You can copy the jar to the spark jars directory.

Implementations that do not have JavaServices code will work the same as without this META-INF data.

snoe925 commented 6 years ago

I found that these changes were required to get 4mz working with newAPIHadoopFile. Here is an example spark shell reader.

sc.newAPIHadoopFile("data.4mz", classOf[com.hadoop.mapreduce.FourMzTextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text])
jordiolivares commented 6 years ago

Why hasn't this been merged yet?

Specifically, commit f6a57e3 has a really basic fix necessary for ZSTD to function properly. I would also add that FourMcTextInputFormat also needs to add the LongWritable and Text generic fields like FourMzTextInputFormat in your version.

snoe925 commented 6 years ago

I can volunteer as a maintainer. I can also make an official repo if you want to avoid notifications.

carlomedas commented 6 years ago

I'd like to merge the pull requests of the first part. While the index changes on the 4mc CLI is not clear to me. What is it doing? The index in 4mc/4mz files is already inside the file itself.

carlomedas commented 6 years ago

P.S.: I can you your help to rebuild the lib on all platforms.

snoe925 commented 6 years ago

I should have pushed the external index code on a branch. I was doing an experiment on timestamp indexing the data in a 4mz. Let me fix the pull request.

snoe925 commented 6 years ago

I have removed the incorrect index code commit from this pull request.

snoe925 commented 6 years ago

For platform building I will open a separate pull request for a Travis CI integration file. That can build Linux and OS X. I do not have Windows build machines.

carlomedas commented 6 years ago

Yes that'd be perfect, even if Linux is not an issue. I'm going to rebuild a new version of the lib soon and also Mac is easy. The only issue I have now is with windows, where you need cygwin64 to build it correctly to work good with latest JRE7/8 on latest Windows versions. Since I don't think there is a lot of people using it, we could even think about releasing without it unless we find the time to recreate the build system I unfortunately lost in the past year...