Closed duzx16 closed 1 year ago
Hi @duzx16 ! We are considering releasing more code --- currently, many of the scripts are formatted for running on ai2 clusters, so, may not be useful in raw form for public use. Do you have a sense of which pieces would be most helpful to release (e.g., alignment code, nsfw classifier, low-cost face "this image may contain a face" detector, or ...?)
Hi @duzx16 ! We are considering releasing more code --- currently, many of the scripts are formatted for running on ai2 clusters, so, may not be useful in raw form for public use. Do you have a sense of which pieces would be most helpful to release (e.g., alignment code, nsfw classifier, low-cost face "this image may contain a face" detector, or ...?)
@jmhessel I think the deduplication and the nsfw classifier will be most useful since there are many details to reproduce. The nsfw classifier is also helpful in other scenarios. I am not sure whether it is open-sourced in the laion-5B paper.
Requirement for de-duplication code +1
+1
Sorry for the very delayed reply. The NSFW detector is the same as released in this library:
the deduplication, as mentioned in the writeup, is:
Hi @jmhessel , I am also interested in the data processing code. I was wondering if you have finally released it. Thanks! (I am also curious about your original C4 data processing code too). Thanks!
Hi @miguelusque ! Which preprocessing code were you specifically looking for? The NSFW detector and deduplication is linked above. I am planning to release the fast face detection code if you're waiting on that.
Hi @jmhessel , thank you for your swift reply.
I am actually interested in the full pipeline. I assume there is lang detection with a specific threshold, also the removal of documents with less than a certain number of sentences, also maybe the removal of sentences with less of a certain number of words. I also think that your sentences removal based on a list might also include some heuristic (remove if more than x words from that list are happening in a sentence/document).
The rationale behind my interest is because I have been involved in that kind of data preprocessing pipelines, and I am always keen of learning from other colleagues in the field.
Btw, congrats for the great work you are all doing.
Thank you for your great work. I am wondering whether you plan to release the data processing code. With the code released, people can process other datasets based on web pages beyond C4.