Publish dataset? - Githubissues

juh9870 commented 1 year ago

Hello, is there info about the dataset used for training the model? It's pretty important for some use cases (like publishing games on steam) to be sure that all training data is licensed appropriately, and I can't find any mentions of what kind of fonts were used during training. Were they all free?

davelab6 commented 1 year ago

If they are OFL, do you believe the output must be OFL too?

juh9870 commented 1 year ago

The issue is that without the dataset being public, we can't be sure that all inputs were OFL, and so reasoning can't continue past that

SerCeMan commented 1 year ago

👋 The font dataset was mainly a combination of fonts from https://www.dafont.com/ and https://allfreefonts.co/, I didn't release the dataset because releasing scraped dataset isn't a very ethical thing.

If there an alternative fully OSS dataset, it's possible to re-train the model pretty trivially.

eliheuer commented 1 year ago

Thanks! you could use the Google Fonts repo: https://github.com/google/fonts

If you look in the ofl directory there are around 1.5k font families (3.5k fonts) all licensed under the same OSS license, the OFL.

Almost all of these fonts have sources available in public git repos, training on and outputting UFO font sources is something I'm interested in and might produce better results when trying to use ML for real world font production.

alerque commented 1 year ago

In other words the training data includes lots of free-as-in-beer fonts, but definitely many specifically not licensed for this kind of use and anything derived from the model will be of questionable domain.

I suggest re-training on a data set such with only whitelisted licenses such as OFL/Apache/Public Domain/etc. as Google Fonts / FontSource / etc. Not only will the licensing of derivations be clearer, the quality of fonts is likely to be higher. Many of the one-off free-font projects that post personal-use releases but no sources tend to be lower quality that OFL projects that also post sources and hence have a change of being critiqued & improved over time.

juh9870 commented 1 year ago

Thanks for clarification. Seems like retaining is required using only open fonts.

SerCeMan commented 1 year ago

Thank you for the suggestions. I think the biggest struggle here will be to find a large-enough dataset, 50k to 100k, at least for this particular model architecture. And, I'm happy to be wrong, but reading through https://blog.eleuther.ai/transformer-math/, and working through the dataset, it seemed that even the 70,000 dataset that I used wasn't large enough to capture the meaning of the prompts effectively beyond simple categories like sans, serif, bold, sci-fi, etc.

SerCeMan commented 1 year ago

A couple of additional things to consider:

Some fonts get filtered out early due glyphs having too many lines and hitting the limits. This will naturally reduce the dataset size further, not by much, but it'll definitely reduce the diversity of the dataset preferring simpler fonts.
Removing the textual description can help in the sense that the model will largely become a font "completion" model, but it will still require some input, e.g. a complete glyph A in order for attention to work effectively.

SerCeMan commented 1 year ago

training on and outputting UFO font sources is something I'm interested in and might produce better results when trying to use ML for real world font production.

Thank you for mentioning UFO, TIL.

tphinney commented 1 year ago

Having a dataset of almost all low/highly-questionable quality would seem to likely mean that the results would inevitably be of really low quality as well. Besides the legal issues, that would be a good reason to use open-source sources and the Google Fonts library, instead.

Not sure how the current US Copyright Office position interacts with this at a legal-meets-practical level. I mean, without copyright on the result, can the fonts created by the AI be licensed under OFL? I am not a lawyer, but I at least suspect the answer is “no.” Would that in turn mean that any such usage by an AI would violate the OFL terms, because the Copyright Office position forces it to do so, even when the licensors would presumably be fine with it? I don’t know, but this sure seems like an unfortunate problem.

SerCeMan commented 1 year ago

This is a very good point, however, I'd also argue that:

From a quality perspective, I wouldn't say that "free as in free beer" reflects the quality of the dataset, especially considering things like glyph downsampling and the necessary filtering. It could also be the case that the best results might be achieved by first training a model on a huge variable quality dataset, and then fine-tuning the model on high-quality data, similarly to how modern LLMs are trained.
From the copyright perspective, I'd say it depends on your goal. I wholeheartedly agree that in order to train a model that can be used in commercial settings, e.g., something that https://fonts.google.com/ or any similar platform could offer to their users, it's necessary to train the model on OFL fonts. However, if the goal is to explore the space and find the best-performing model architecture, then in the absence of a large corpus of OFL fonts, "free as in free beer" is a perfectly valid research alternative. The results of FontoGen are decent, but there is definitely huuuge room for improvement, and a lot of further research is required. I wouldn't want any research to be stalled due to not having a large fully open source dataset.

tphinney commented 1 year ago

I disagree on your (1), but agree on your (2).

My problem is, I can’t tell whether any “this sucks” problem is due to the inputs, limitations engendered by limited computing resources, or limitations of the AI itself. So as a start, I would favor fixing the one of these three things that is most easily fixed. :)

Don’t get me wrong, I am super impressed that you have an AI process that is generating vector fonts, and I rather expect this to improve quickly and dramatically, as well!

tomByrer commented 1 year ago

Not sure how the current US Copyright Office position interacts with this at a legal-meets-practical level. I mean, without copyright on the result, can the fonts created by the AI be licensed under OFL?

I'll ask my IP lawyer friend next time I see him.

simoncozens commented 1 year ago

I disagree on your (1), but agree on your (2).

(2) is even more problematic.

As Timnit Gebru and others have pointed out repeatedly, the whole history of modern AI is people knocking out models "to explore the space" with whatever data they can get their hands on, having those models get fielded, and then dealing with the ethical and legal questions afterwards (if at all).

I'm flabbergasted that someone building an AI model today would still do things in that order.

SerCeMan commented 1 year ago

As Timnit Gebru and others have pointed out repeatedly, the whole history of modern AI is people knocking out models "to explore the space" with whatever data they can get their hands on, having those models get fielded, and then dealing with the ethical and legal questions afterwards (if at all). I'm flabbergasted that someone building an AI model today would still do things in that order.

I believe this argument conflates research projects with building products, and if anything, it's what can lead to gatekeeping of the entire space by large organisations.

I suggest we avoid discussing this specific conversation branch any further as it's highly opinionated, has been discussed many times in other mediums, and the discussions can easily become heated without leading to a productive outcome.

simoncozens commented 1 year ago

I believe this argument conflates research projects with building products

Research projects are conflated with building products. That's how this space works. When people start developing a model, they start with a research project. They don't then stop and completely throw it away and build another one to productionize it. They just productionize it. That's precisely why you have to get this stuff right during the research phase, if not before.

Your argument is that getting the legal and ethics side right should not "slow down" research. That's completely the wrong way around. If you don't have the right to research, don't do it, and sort that right out first.

I suggest we avoid discussing this specific conversation branch any further as it's highly opinionated, has been discussed many times in other mediums, and the discussions can easily become heated without leading to a productive outcome.

Sorry, but that's not a useful suggestion. Refusing to discuss a problem is just another way of ignoring it.

simoncozens commented 1 year ago

I should add: it probably sounds like I'm being completely negative and I don't want this project to happen. I really do. I've done a number of experiments trying to creating vector fonts using DNNs myself, and I never really got anywhere, and I really like the approach of this one. It has a lot of potential. That's why it would be good to get it right.

justinpenner commented 1 year ago

👋 The font dataset was mainly a combination of fonts from … https://allfreefonts.co/

This looks like a pirate site to me. They're offering free downloads of what appear to be retail fonts from marketplaces like Creative Market and Creative Fabrica, telling users they're "free for personal use" (which they don't appear to be) and then linking to the font's product page with referral codes so they get a commission from the marketplace. Quite a bold scam if you ask me.

Also on a related note, Creative Market has their own AI font generation project in development (https://fonts.ai), so they might not be happy that you're using fonts from their library in your dataset.

arrowtype commented 1 year ago

I suggest we avoid discussing this specific conversation branch any further as it's highly opinionated, has been discussed many times in other mediums, and the discussions can easily become heated without leading to a productive outcome.

If the definition of “a productive outcome” is IP theft and laundering at scale, then no, considering ethics and legalities up front probably isn’t helpful. I would suggest, however, that establishing an ethical & legal framework for such projects – at their outset – would be a very productive outcome.

ChristineBateup commented 1 year ago

This is Christine Bateup, Director of Business & Licensing and Counsel for Frere-Jones Type, writing on behalf of Frere-Jones Type. Some basic research on your part would have confirmed that https://www.allfreefonts.co/ is indeed a pirate site. Similar to other type foundries, we try to keep our fonts off these sites, but it's often a game of whack-a-mole, and many pirate sites have opaque ownership and contact information and don't respond to DMCA notices.

I say this as background as some of the fonts to which Frere-Jones Type owns exclusive copyrights are available for unauthorized illegal download on https://www.allfreefonts.co/. These font files all contain our copyright information and license string, which directs users to https://frerejones.com/ for licensing information. Our fonts are only available for licensing for a fee from https://frerejones.com/, and we do not permit use of our fonts for any machine learning or artificial intelligence purposes. You have willfully infringed our copyrights to the extent that you have used any of our fonts as part of your data set. If you have used any of our fonts in your data set, any output from your model created by you or other users also infringes our copyrights, and are infringing and unauthorized derivative works.

Please confirm within 24 hours whether your data set for training the model includes any of our proprietary fonts, or any fonts owned by Tobias Frere-Jones. If we do not hear from you, we will assume that your data set does contain our fonts, and reserve our rights to take additional action.

I also understand you work for Canva, can you confirm whether any of this work is being done on your employer's behalf.

A further word of advice. In light of how https://www.allfreefonts.co/ operates, there are serious legal problems with your current data set beyond our individual case. It's incumbent on you to grapple fully with these issues, even if you just consider this a research project, rather than burying your head in the sand about potentially serious ethical and legal consequences.

You can contact me at christine@frerejones.com to discuss further.

davelab6 commented 1 year ago

The issue is that without the dataset being public, we can't be sure that all inputs were OFL, and so reasoning can't continue past that

My personal understanding is that if only one input was OFL, the result must be OFL, if the result is subject to copyright. I expect in entirely my personal opinion that this will vary by jurisdiction.

alerque commented 1 year ago

@ChristineBateup Every legal point you make is probably correct and pointing out the seriousness of the issue and known unauthorized distribution from pirate sites this project has mentioned is probably well placed. On the other hand adding "respond with your innocence in 24 hours or we assume you are guilty and commence suing you" is a jerky move. That does not endear your client to the type world that's watching here.

arrowtype commented 1 year ago

@alerque as a member of the font world, I would counter with the view that mass intellectual property theft would be a "jerky move"

justinpenner commented 1 year ago

Well, that escalated quickly.

Indeed, I even found one of my font families Armoire on this pirate site, so it looks like my work may have been part of this project's dataset. If someone downloads my fonts illegally to use for a hobby research project like this, where my work is only one point in a larger dataset, I really couldn't care less and it's not an ethical boundary for me as long as it's kept private.

Publishing the results and the training data does cross a line for me, though. I think this project should have been rebuilt with a more ethical dataset before being made public.

alerque commented 1 year ago

@arrowtype Of course it is, which is why so many of us had already started chiming in trying to point out the problem. But that doesn't justify "respond in 24 hours or else your silence is an admission of guilt" aggressive posturing (which probably isn't even legally binding).

SerCeMan commented 1 year ago

Hey, all! In light of all of the discussion, I did only right thing here and took down the model weight from huggingface. Thank you to everyone who pointed out the issues with the dataset.

However, if anyone would like to continue the research, and has a large dataset of OFL fonts, I'm very happy to help to re-train the model on it.

simoncozens commented 1 year ago

Thank you. Can I suggest the google/fonts repo, which is a collection of just over 1000 OFL fonts? That's probably not enough to do meaningful Transformers work, unfortunately, but many of them are variable fonts, which means you can do a form of data augmentation by accessing the VFs at off-instance points in their designspace.

tomByrer commented 1 year ago

@tphinney @SerCeMan I had face to face conversation with IP Lawyer friend about "AI derived work", & he replied "Still to be determined in court". Indeed, I believe the lawsuit against GitHub Copilot is still in process.

That said, Google won their case for scanning & publishing publicly (scanned parts) current copywrited print books.

eliheuer commented 1 year ago

Is anyone currently working on compiling a large dataset of high-quality OFL fonts to re-train the model on? Google Fonts has 3.5k OFL fonts and there are many more OFL fonts on GitHub and GitLab that are not in the Google Fonts repos.

I'll start this work on my own soon, but would love to collaborate if anyone else is working on this and wants to work together.

SerCeMan / fontogen

Publish dataset? #1