Msgids are deleted from POT files

maufl commented 8 years ago

I read through the issues and found that it's a feature, but unfortunately not for me. When running mix gettext.extract all msgids from POT files that have a reference are deleted if they are not present in the extracted msgids.

I'm trying to extract gettext msgids from React components that we use in our Phoenix project. Phoenix uses Webpack to bundle frontend assets and I found this handy loader that will extract msgids from Js(x) files when they are loaded via webpack. All the msgids are dumped to a new POT file in priv/gettext (because I configured it so) and they all have references. Now when I extract msgid using mix, the POT file is emptied :(

I can't change the behaviour of the loader without patching jsxgettext and jsxgettext-loader. I could put the POT file in another directory, but right now the workflow for Phoenix and Reactis perfectly in sync and I would like to keep it that way.

Is there a possibility to force (Elixir) gettext to not delete msgids from a POT file?

josevalim commented 8 years ago

@maufl can you please paste some examples of the msgids extracted from jsxgettext? Maybe we can provide some directory whitelisting. This way we won't remove msgids if they belong to a certain reference.

maufl commented 8 years ago

Currently it looks like this

msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Report-Msgid-Bugs-To: \n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"POT-Creation-Date: 2016-03-30 12:35+0000\n"

#: 0:91
msgid "Column 1"
msgstr ""

#: 0:96
msgid "Column 2"
msgid_plural "Columns 2"
msgstr[0] ""
msgstr[1] ""

But I opend a pull request in the loader repo that should fix the file names, e.g. instead of 0 it will be web/static/js/react_test.js.

Directory whitelisting sounds like a good idea to me.

maufl commented 8 years ago

The change landed in the loader and now the file looks like this:

msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Report-Msgid-Bugs-To: \n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"POT-Creation-Date: 2016-04-04 07:03+0000\n"

#: web/static/js/react_test.js:94
msgid "Buy envelope"
msgid_plural "Buy envelopes"
msgstr[0] ""
msgstr[1] ""

#: web/static/js/react_test.js:95
msgid "First Column"
msgstr ""

#: web/static/js/react_test.js:100
msgid "Second Column"
msgstr ""

josevalim commented 8 years ago

Awesome! So I think we could support some sort of whitelisting where you would tell it to not touch translations from "web/static/js". Would you be interested in trying a patch? :)

maufl commented 8 years ago

I'm already on it. I thought about on which level white/black listing could happen. We could either exclude pot files/domains while extracting based on a pattern, or we could keep translations based on references. Do you know which solution you would prefer? I have a working patch for the first approach but I could write a patch for the second approach too.

josevalim commented 8 years ago

Both are fine to me. Excluding domains is easier for sure if you guarantee to keep the JS translations in separate domains. Let's wait and see @whatyouhide thinks about this.

whatyouhide commented 8 years ago

In my opinion, one of the problems here could be that when extracting new translations, we read all translations from all POT files _in whatever subdir of each backend's :priv_dir directory_ (as can be seen here):

  # Returns all the .pot files for each of the given `backends`.
  defp pot_files_for_backends(backends) do
    Enum.flat_map backends, fn backend ->
      backend.__gettext__(:priv)
      |> Path.join("**/*.pot")
      |> Path.wildcard()
    end
  end

I think one step in the right direction could be to, as a starter, change that wildcard to "*.pot": we only support POT files directly in the :priv_dir directory, not in subdirectories (no slashes in domain names, see #76 as well).

This could work towards solving this issue as well, if the JS translations are simply stuffed in, e.g., priv/gettext/frontend/default.pot (and priv/gettext/frontend/en_US/LC_MESSAGES/default.po in turn). @maufl would this be a feasible solution? We can still support black- or whitelisting, of course, but I think this step is a good step either way.

maufl commented 8 years ago

I'm not sure. I just tried this and it seems that currently, each directory under :priv_dir is expected to be a locale by the gettext.merge task. So putting frontend files into a separate folder would require checking whether the folder contains translations for a locale.

I could put the frontend translations into a completely different folder, but then I would have to run merging twice. My goal was managing both translations together, e.g. running mix gettext.merge will merge all translations, including those from the frontend. I'm not sure if this can be achieved without having both pot files in the same dir.

whatyouhide commented 8 years ago

@maufl yes, you're right. Gettext (gettext.merge in particular) still assumes every directory under :priv_dir is a locale.

So, I don't particularly like either of the proposed approaches but yet I can't think of anything better; excluding domains should definitely be easier to implement, but on the other hand the user is required to stuff all the frontend (or whatever scope) translations in a domain. Of course, you can have frontend_DOMAIN.pot but it's still somehow ugly. On the other hand, this has the benefit of not making Elixir's Gettext have to deal with translations coming from different sources in the same POT file.

Maybe keeping translations based on the reference pattern could be good: when merging, when check what references point to protected paths and we never purge those translations. Somehow however, this still feels dirty to me as the logic behind purging, merging and friends is already somewhat convoluted.

Last proposal, directed mainly to @josevalim: when working on this, we decided the way to tell if a translation comes from Elixir or from the user would be the presence of one or more reference #: comments; it was the easiest and most straightforward solution at the time. However, I proposed something else as well: we identify Elixir translations with a elixir flag (in a #, comment, more here). If we do this, we don't have to change anything else I think because JS translations will not have the elixir flag. Definitely a lot more noise in PO/POT files, but probably more straightforward behaviour. Wdyt?

maufl commented 8 years ago

I like your last proposal.

I thought about white-/blacklisting domains or reference paths, and I think it would work if there is only one backend, which is probably the case most of the time. As soon as we have multiple backends, which is a supported case, it's unclear were the configuration for white-/blacklists should go. Especially in the references case, because right now there is no information available at the point when translations are merged, only when pot files a read.

josevalim commented 8 years ago

I am not a fan of the last proposal because of the noise. :( I wouldn't worry about multiple backends because why would you also spread the javascript related translations across backends?

whatyouhide commented 8 years ago

If we go with blacklisting/whitelisting, we can configure it in the :gettext application I think, so that it applies to all backends.

josevalim commented 8 years ago

I am fine with per backend or for the :gettext app.

whatyouhide commented 8 years ago

I proposed to configure it for :gettext because I have a hard time imagining users wanting something blacklisted from a backend, but not from another backend; if Gettext for Elixir is not handling you frontend translations, then it's not a decision at the backend level :). Does it make sense? If so, @maufl wdyt?

maufl commented 8 years ago

Yes, I think that's a good idea. In this case I would go for reference white-/blacklisting, so the user is free to use the same domain in frontend and backend.

I can take a shot at implementing it. :)

whatyouhide commented 8 years ago

@maufl yep, let's go with this for now:

when merging, when check what references point to protected paths and we never purge those translations

Of course, please do take a shot at implementing it, it would be awesome :smiley: Let me know if you find any bump in the road, especially if it's caused by some code being not-so-clear :).

maufl commented 8 years ago

How would you name this configuration option? I'm not sure how to best express the intention.

(protect|exclude)_(refs)?_from_purging

maufl commented 8 years ago

I think this patch might suffice. Can you take a look at the code and let me know if I forgot about something? If the code is OK, I will add tests and documentation and open a pull request (or I can do this now).

There where some bumps :) Sometimes it's not easy to follow the code, but I can not point my finger at any particular problem. Also, I'm quite new to Elixir (~2 months in), so it might just be me not being used to functional programming.

whatyouhide commented 8 years ago

@maufl you can open a PR so I can review the changes if you want. :)

maufl commented 8 years ago

Now I have a test and documentation. I opened a pull request.

josevalim commented 8 years ago

:heart: :green_heart: :blue_heart: :yellow_heart: :purple_heart:

whatyouhide commented 8 years ago

This landed in #86, so closing this.

elixir-gettext / gettext

Msgids are deleted from POT files #85