Any plan to release the data processing code?

allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.

MIT License

904 stars 34 forks source link

Any plan to release the data processing code? #4

Closed duzx16 closed 1 year ago

duzx16 commented 1 year ago

Thank you for your great work. I am wondering whether you plan to release the data processing code. With the code released, people can process other datasets based on web pages beyond C4.

jmhessel commented 1 year ago

Hi @duzx16 ! We are considering releasing more code --- currently, many of the scripts are formatted for running on ai2 clusters, so, may not be useful in raw form for public use. Do you have a sense of which pieces would be most helpful to release (e.g., alignment code, nsfw classifier, low-cost face "this image may contain a face" detector, or ...?)

duzx16 commented 1 year ago

Hi @duzx16 ! We are considering releasing more code --- currently, many of the scripts are formatted for running on ai2 clusters, so, may not be useful in raw form for public use. Do you have a sense of which pieces would be most helpful to release (e.g., alignment code, nsfw classifier, low-cost face "this image may contain a face" detector, or ...?)

@jmhessel I think the deduplication and the nsfw classifier will be most useful since there are many details to reproduce. The nsfw classifier is also helpful in other scenarios. I am not sure whether it is open-sourced in the laion-5B paper.

yuezewang commented 1 year ago

Requirement for de-duplication code +1

152334H commented 1 year ago

jmhessel commented 1 year ago

Sorry for the very delayed reply. The NSFW detector is the same as released in this library:

https://github.com/mlfoundations/dataset2metadata

the deduplication, as mentioned in the writeup, is:

https://gitlab.com/opennota/findimagedupes

miguelusque commented 1 year ago

Hi @jmhessel , I am also interested in the data processing code. I was wondering if you have finally released it. Thanks! (I am also curious about your original C4 data processing code too). Thanks!

jmhessel commented 1 year ago

Hi @miguelusque ! Which preprocessing code were you specifically looking for? The NSFW detector and deduplication is linked above. I am planning to release the fast face detection code if you're waiting on that.

miguelusque commented 1 year ago

Hi @jmhessel , thank you for your swift reply.

I am actually interested in the full pipeline. I assume there is lang detection with a specific threshold, also the removal of documents with less than a certain number of sentences, also maybe the removal of sentences with less of a certain number of words. I also think that your sentences removal based on a list might also include some heuristic (remove if more than x words from that list are happening in a sentence/document).

The rationale behind my interest is because I have been involved in that kind of data preprocessing pipelines, and I am always keen of learning from other colleagues in the field.

Btw, congrats for the great work you are all doing.