Signbank / Global-signbank

An online sign dictionary and sign database management system for research purposes. Developed originally by Steve Cassidy/ This repo is a fork for the Dutch version, previously called 'NGT-Signbank'.
http://signbank.cls.ru.nl
BSD 3-Clause "New" or "Revised" License
19 stars 12 forks source link

JSON error for /package endpoint #1149

Closed Woseseltops closed 6 months ago

Woseseltops commented 7 months ago

ELAN dev Divya Kanekal reports that API endpoint https://signbank.cls.ru.nl/dictionary/package/?&dataset_name=tstMH doesn't work anymore, and I can confirm... I suspect there is something wrong with an individual gloss

image

Woseseltops commented 7 months ago

@susanodd , can you investigate?

susanodd commented 7 months ago

Yes, I'm on it.

susanodd commented 7 months ago

It chokes when looking for small videos. (Raises an exception and passes.) But that's probably not the error yet...

It looks like it has a gloss ID that is an integer not quoted as a key.

It's on gloss 4611. But modifying the get_gloss_data function so it the id is quoted does not resolve the bug. That is the first gloss it finds, so the function seems a problem.

What should the API_FIELDS be? It is including some that are empty value strings.

susanodd commented 7 months ago

From the nested "parsing" of the data that gets written to output, these are they keys of the info about each gloss:

dict_keys(['Annotation ID Gloss: Dutch', 'Annotation ID Gloss: English', 'Lemma ID Gloss: Dutch', 'Lemma ID Gloss: English', 'Translations', 'Simultaneous Morphology', 'Sequential Morphology', 'Link']) But that seems too many fields.

It gets all the way to the last gloss in the loop, this goes wrong on the last gloss with ID 46505

output = json.dumps(data, indent=INDENTATION_CHARS)

The input data seems okay (it can be printed). So maybe end of line or something after the data? Will keep looking

susanodd commented 7 months ago

My bad. I found the error. I had expanded the original get_fields_dict to work on a broader selection of fields, but forgot to make sure they were in the argument list of fields.

susanodd commented 7 months ago

This is live on signbank-dev now.

susanodd commented 7 months ago

This is live on signbank too.

susanodd commented 7 months ago

@Woseseltops there are numerous zip files in the writable/packages folder of signbank.

Can you figure out based on old files what is supposed to be in them? It's the glosses.json file that was causing the problem.

On signbank, there are numerous signbank_package zip files all on March 30 2023. I don't know why there are so many on that day. Perhaps that's when it got broken? There are also numerous signbank_patch zip files on the same day.

Also, isn't this supposed to be a file that gets downloaded, not stored on disk?

Woseseltops commented 7 months ago

On signbank, there are numerous signbank_package zip files all on March 30 2023. I don't know why there are so many on that day. Perhaps that's when it got broken? There are also numerous signbank_patch zip files on the same day.

Perhaps this was the day we moved our old writable folder to a new container ?

Also, isn't this supposed to be a file that gets downloaded, not stored on disk?

Yes, but if I remember correctly the reasoning here was the expected usage was expected to be low, so it's not worth the time to setup a cleaning schedule.

Can you figure out based on old files what is supposed to be in them?

From the top of my head, a JSON file, and for the rest lots of video files and images. This end point is designed to be called by ELAN, so that videos and images can be watched while annotating. Like you discovered, there's a package option, which gives you everything, and a patch option, which gives you everything changed since a particular date.

susanodd commented 7 months ago

Do you recall if there should be anything else in glosses.json besides the idgloss ?

There is a setting that is used. API_FIELDS = ['idgloss'] But the field idgloss predates lemmas and datasets. There used to be a gloss field idgloss, which eventually became a property instead and is a lengthy lookup.

Since there is a new issue about API maybe this needs to be refined.

susanodd commented 7 months ago

@Woseseltops in 2023 March 30 it used to have this info in the glosses.json file:

"43861": {
    "Annotation ID Gloss: Dutch": "een-locatie-een-actie",
    "Annotation ID Gloss: English": "one-location-one-action",
    "Translations": "",
    "Morphemes": "",
    "Parent glosses": "",
    "Link": "https://signbank.cls.ru.nl//dictionary/gloss/43861"
}

I'll restore it to match that. The Translations are now Senses though, with separate languages. How to relay that to users of the functionality?

NOTE: For the above, for ALL of NGT, none of the glosses had a non-empty "Parent glosses" field. Likewise, NONE of the glosses had a non-empty "Morphemes" field.

susanodd commented 7 months ago

@Woseseltops I've revised it as so:


    "44314": {
        "Annotation ID Gloss: Dutch": "BRUS-A",
        "Annotation ID Gloss: English": "SIBLING-A",
        "Translations": "vriend, somebody at the same level, brother, friend, brother, zus, iemand dichtbij je, vriendin, sister",
        "Link": "http://localhost:8000//dictionary/gloss/44314"
    },

The morphemes and parent glosses are not implemented the same any more. Here, all the senses appear in Translations, mixed languages. (On the local host it is.)

susanodd commented 7 months ago

@Woseseltops What does the user Divya Kanekal want to be included in it?

susanodd commented 7 months ago

The vintage API fields from 2023 have been restored. This has been deployed to signbank.

rem0g commented 7 months ago

When using the API endpoint, login is required.

What are options to collect the data from our platform, for automated purposes?

susanodd commented 7 months ago

I think I can modify it so it can be without login. It's probably set in the urls.

As it is now, it retrieves all glosses.

If you've done querying in signbank, you will have noticed the extremely long url with all the parameters shown.

If you've used the Query View, and stored any queries, then it uses a shorthand "?query" in the url.

Probably it would be possible to use this sort of approach to create queries via (an enhanced) API interface and save them and then use them this way. Then all the parameters would not need to be repeatedly sent.

Can you give an example of a type of query/operation you would like to do?

Ideally you do not want to be communicating huge amounts of data back and forth as json. The use of stored queries in Signbank makes this more efficient. So abstractly, you could make up a set of common type of queries and have them as operations/abstractions/commands at a higher level.

susanodd commented 7 months ago

Also, if you've looked at the Minimal Pairs view (under the Analysis menu), those are all computed dynamically. A very complicated query analyses this. (The code is visible in the implementation.) It's run as a bunch of ajax queries per gloss because otherwise it takes way too much time to compute. In the Gloss List view, the rows are individual ajax calls.

susanodd commented 7 months ago

The queries are query parameters dictionary objects. Probably a kind of "machine" login account could be made and this meta-user could create query objects and execute queries this way. Like a virtual machine. All of the database operations exist as urls. You can check this out in the urls.py files. The idea of the query objects is to move all the parameters into the query dict object rather than being communicated in forms with get and post.

susanodd commented 7 months ago

I moved all of the Oefen animal glosses to NGT earlier (by hand). To facilitate this, I added a Semantic Field tag Animal to each gloss. (With only dataset Oefen selected). To move the glosses both Oefen and NGT had to be selected. It was straightforward to find all the animals using the Semantic Field Animal. I would recommend this for the other sets of glosses you guys have made for Health Care. (If needed you can create new semantic fields in the admin.) This will facilitate querying them later since the API could reduce the amount of glosses retrieved more easily that way.

susanodd commented 7 months ago

In the Query View (if you have just done a Search, then go to the pulldown Analysis -> Query View, you see the same results, but showing the columns corresponding to the parameters of your query, and can toggle the columns to focus on the interesting fields). Then SAVE your query. Then go to Search History. Here you can give your query a name.

So for example, if you query on Semantic Field Animal, and then save your query as "find_animals" then you can execute your search as ?query='find_animals". Although at the moment, you need to use the query ID rather than find_animals.

(Only your own queries are visible to yourself.)

This is the gist of an abstract machine for querying. The API could offer commands this way. Then give the output as json. (The ajax call urls give json. See the urls.py files.)

susanodd commented 7 months ago

When using the API endpoint, login is required.

What are options to collect the data from our platform, for automated purposes?

Yes, the function that is called by the url is prefaced with login_required

It should work to just remove that. Since NGT is public, it should be okay to obtain the public glosses that way without logging in. But only as read-only.

I can fix that on Monday.

The package function will then check if the request comes from a logged in user and if not, it only allows NGT. (So same as Guest users.)

The tstMH test dataset @Woseseltops showed above in the errors is not public because it contains nonsense signs.

susanodd commented 7 months ago

I could make a dummy "query machine interface" so you could try requesting package things like the animals query mentioned above. If you check the gloss fields returned above, see if that's sufficient for developing on your side. I need to repair it so it offers Senses par language instead of Translations.

susanodd commented 7 months ago

For the glosses you create on your side in NGT (including the moved animal glosses), you will need to set the inWeb to be True, or confirm it is. (That's easiest with an update gloss CSV.) We can only show the glosses explicitly marked as inWeb as public.

susanodd commented 7 months ago

About the Query saving, you can also save fields you searched on like creation date and created by user. Just add things to your glosses to be able to query them explicitly. (There are many Semantics fields to exploit. Also for part of speech.)

If you want to be able to add notes about discussions for signs on your end, that could be done with a cron job type operation at the end of the day, if you save the changes locally. (That cron job would need to login to update.)

To see notes and things, they would need to be public / published notes. Then an additional field could be added to the glosses json output. Plus maybe a flag to retrieve advanced features in the package interface. Tags can be included in the glosses json.

susanodd commented 7 months ago

A the moment, anonymous users do not see the Analysis menu where the queries are available.

I could make it so that if the user is not logged in, there is a pre-defined set of queries available. So the page would look just like buttons you could press for various frequent kinds of queries.

These could then also be available via the API. (Like the find_animals example, but then available to everybody.) The corresponding urls would yield json.

(An obvious caveat is that the glosses need to have semantic fields, word class, etc added to them.)

The intention of the Semantic Fields is actually to make this possible. For datasets with languages that are not in the interface choices, you can also add semantic field translations in those languages. So you could have buttons in Japanese or French.

susanodd commented 6 months ago

@rem0g are you making some kind of intersystem reference table? Signbank ID related to an ID in your system?

I've been thinking about this the past few days, wondering what it is you want to use the json info to build. The image and video links are included in json files, plus the glosses info.

But without being logged in, the glosses need to be inWeb and in either NGT or in a public dataset. The information is basically what is visible in the Public View of a gloss. But say the semantic fields and published notes can also be visible.

There is a parameter for a date/time on the above url. But this only retrieves newly created glosses, not recently updated glosses.

For each gloss a Revision History exists. You can see this in the Gloss panels in Gloss View. So if necessary, we could look in the revision history to retrieve recently updated glosses. (The revision history also stores what user did the changes and what the change was.)

(This is why a login is needed if you want to update things.) The user name is in the revision history shown, not any email. So possibly this could be retrieved in json form.

At the moment, the revision history is not anything executable. (In contrast to the query parameters.) The class type for revision history could be modified if you need this kind of information on your end.

It looks like the tags can be included in the public json. They don't really contain any sneaky information. These are easy to update via csv.

susanodd commented 6 months ago

This is live now.

susanodd commented 6 months ago

When using the API endpoint, login is required.

What are options to collect the data from our platform, for automated purposes?

This is live now. You can obtain the NGT public glosses via the package command without being logged in.

Let me know if the fields offered for glosses are okay. I'll fix the Translations to convert to Senses. (Still to do.)

susanodd commented 6 months ago

@rem0g the glosses you created that were originally in Oefen, the Animals, they are now in NGT. But those glosses are not public, so they are not visible to anonymous users.

susanodd commented 6 months ago

Not pushed yet, but this is the json revision for Senses instead of Translations lumped together:

    "44314": {
        "Annotation ID Gloss: Dutch": "BRUS-A",
        "Annotation ID Gloss: English": "SIBLING-A",
        "Senses: Dutch": {
            "1": "iemand dichtbij je",
            "3": "vriend, vriendin",
            "5": "zus"
        },
        "Senses: English": {
            "1": "somebody at the same level",
            "2": "brother, sister",
            "3": "friend",
            "4": "brother"
        },
        "Link": "http://localhost:8000//dictionary/gloss/44314"
    },
susanodd commented 6 months ago

Here's another with the default fields of Gloss List included in the glosses.json

    "45869": {
        "Annotation ID Gloss: Dutch": "WALRUS",
        "Annotation ID Gloss: English": "WALRUS",
        "Translations": "walrus, walrus",
        "Senses: Dutch": {
            "1": "walrus"
        },
        "Senses: English": {
            "1": "walrus"
        },
        "Handedness": "2s",
        "Strong Hand": "C",
        "Weak Hand": "C",
        "Location": "Cheek",
        "Link": "http://localhost:8000//dictionary/gloss/45869"
    },
susanodd commented 6 months ago

@rem0g would you like the json to be multilingual, in Dutch? I can add a parameter. The keys are static in English so far. For CSV the headers are in English.

Jetske commented 6 months ago

Here's another with the default fields of Gloss List included in the glosses.json

    "45869": {
        "Annotation ID Gloss: Dutch": "WALRUS",
        "Annotation ID Gloss: English": "WALRUS",
        "Translations": "walrus, walrus",
        "Senses: Dutch": {
            "1": "walrus"
        },
        "Senses: English": {
            "1": "walrus"
        },
        "Handedness": "2s",
        "Strong Hand": "C",
        "Weak Hand": "C",
        "Location": "Cheek",
        "Link": "http://localhost:8000//dictionary/gloss/45869"
    },

@susanodd I contacted Divya and these fields cover all necessary fields for ELAN.

susanodd commented 6 months ago

Okay, that's good to know!

rem0g commented 6 months ago

@susanodd would it be possible to retrieve JSON output, so not in zipped file, from one glos only?

For example: https://signbank.cls.ru.nl/dictionary/package/?&dataset_name=NGT/?&gloss=ELEPHANT and https://signbank.cls.ru.nl/dictionary/package/?&dataset_name=NGT/?&gloss-id=4681

the output should be as complete as possible, so i can for example show the fields in Gebarenoverleg Platform and show video/image to our users.

We also should consider to update new data for those glosses and have revision control for that (when possible), but i also can backup the original data first on GOP (Gebarenoverleg Platform) and revise when needed for our users.

susanodd commented 6 months ago

We have this kind of url:

/dictionary/ajax/glossrow/180/

(where the number is a gloss ID)

I see though that is has already put the json into a template.

(The json functionality exists behind the scenes.)

I'll make such a url to retrieve a particular gloss as json as you describe.

susanodd commented 6 months ago

This one does retrieve json, it's for gloss complete, where you enter characters the gloss begins with (here, "to").

https://signbank.cls.ru.nl/dictionary/ajax/gloss/to

(Yes, it's not a match. I was checking to see what we already have.)

susanodd commented 6 months ago

We need to come up with a different url name than the package url, because the zip output is used by researchers at MPI for ELAN.

rem0g commented 6 months ago

Can this endpoint also be used for adding new glosses and upload new videos (with video URL or blob format?)

susanodd commented 6 months ago

I'll see what we have. For adding glosses or uploading video a login would be required. Or some kind of token. It may be possible to include the IP of your machine (server) in the allowed machines for our server. (Or something akin to that.) There is a command for uploading a bunch of videos from a folder. (For if the glosses already exist.) When we upgraded Django some of the "multiple files" forms we used to have were no longer allowed by Django because of security issues.

There's still a command url: /dictionary/import_videos/ That uses the setting settings.VIDEOS_TO_IMPORT_FOLDER

(The command looks for videos in that location.)

Inside that folder, the structure is dataset(name or acronym)/language(3char-abbreviation)/annotationidgloss.mp4

/ASL/eng/HOUSE.mp4 To figure out where the video should be put, it is looked up as: ``` glosses = Gloss.objects.filter(lemma__dataset=dataset, annotationidglosstranslation__language=language, annotationidglosstranslation__text__exact=filename_without_extension) ``` So conceivably, you could upload a bunch of videos to the particular folder. (Somehow.) Then run the command. (Perhaps as a cron job.) I need to see how that works. This was a command @ocrasborn would use to exchange videos with other researchers.
susanodd commented 6 months ago

It's probably possible to allow a Dataset Manager to upload a zipped file of videos on the Manage Dataset page. And then via a cron job unzip this into the settings.VIDEOS_TO_IMPORT_FOLDER location, then proceed as above with the command. I'm not sure if an unzip command would be wise in the interface, but a cron job sounds okay. But it would need to inspect that all the files are indeed video files.

If you are somehow able to collect the "updates" on your side. You could show users the local copy until it can be uploaded.

[Unzip is currently not installed on the server. But some file system commands are done by Signbank, like creating folders, etc. Signbank runs inside a container. It used to run on Django 1.11 and university computing threatened to take it down if we didn't upgrade it. I used to have an apache server on my desktop computer but they disabled my computer's IP until apache was completely removed. Now we have several Signbank containers running for development work.]

[That is why we are cautious about how updates are done....]

susanodd commented 6 months ago

I'm making the gloss retrieval url so that it is graceful. If the dataset id does not exist or the gloss id does not exist in the dataset, it just returns {}.

Oh, forgot one, if the dataset id does not exist or the user is anonymous, it defaults to NGT. Then looks up the gloss. If that does not succeed, it returns {}.

Otherwise, the gloss info in json.

susanodd commented 6 months ago

@rem0g Here is the first version:

https://signbank.cls.ru.nl/dictionary/get_gloss_data/5/4681/

That's elephant.

Because it's public, you can also view it anonymously, but see fewer fields.

The "5" is the dataset ID for NGT

It doesn't take any parameters other than those in the url. As mentioned above, if the ids are messed up, it returns nothing.

susanodd commented 6 months ago

It only shows direct fields of the gloss, not relations or morphology, nor Notes or Tags. (That requires more work because it involves related objects.) Only the non-empty fields are shown. Plus the Booleans, since False is also a value.

susanodd commented 6 months ago

@rem0g I'll make another url for retrieving the set of fields that are available. The above url yields the fields that are not empty. There are many many fields for a gloss, but most are empty. The json would contain lots of empty fields otherwise.

You can see a preview of the fields in the Import CSV Update Gloss template: The really long scroll bar. That also shows the (dynamic) choices for the fields. (Except for Tags.) I can make additional urls to return the actual choice lists for each field. (These are multilingual. But CSV import uses the English ones. That's required for all datasets.)

Since in #1152 you are adding meta information, probably this is most interesting, the Tags and Notes. (That requires a bit more programming because the gloss that is retrieved also refers to other objects.) I'll work on that next.

If you need additional urls for retrieving specific collections of information, I can make that too.

susanodd commented 6 months ago

I'll see from @Woseseltops if in regards to #1152 we can have it so some urls are meant for your server. So like a limited scope that your server (only) has permission to do, for updates.

The CSV routines have several stages. After debugging syntax and checking for errors, it shows a list (table) of the proposed changes that will be carried out. That lives inside the template. I can abstract this and make urls that follow that pattern of (implicit) commands and make those as urls available for doing the updates. But there would still need to be two steps, so probably two urls where the first does the syntax and and consistency checks, then the second after the first. Doing so would also improve the existing CSV code. (This code is extremely lengthy and long-winded and difficult to follow. It has been in use for more than 10 years and has undergone numerous migrations and system reworks. Kind of like a vintage car.)

susanodd commented 6 months ago

@rem0g I added this url to obtain the fields. These are then the domain of the gloss data url, but shows the entire list. (With exception of relations, related objects. TO DO.)

https://signbank.cls.ru.nl/dictionary/get_fields_data/5/

(The dataset is needed because the translation languages are used in the annotation and senses fields.)

Again, if you are not logged in, you see fewer. Hopefully this will make it easier to use both urls on your side, since you can tell what ones are empty this way.

susanodd commented 6 months ago

I'll make another that shows the choices for the fields that have choices. The fields above are basically those included in the CSV glosses export.

rem0g commented 6 months ago

I'll see what we have. For adding glosses or uploading video a login would be required. Or some kind of token. It may be possible to include the IP of your machine (server) in the allowed machines for our server. (Or something akin to that.) There is a command for uploading a bunch of videos from a folder. (For if the glosses already exist.) When we upgraded Django some of the "multiple files" forms we used to have were no longer allowed by Django because of security issues.

There's still a command url: /dictionary/import_videos/ That uses the setting settings.VIDEOS_TO_IMPORT_FOLDER

(The command looks for videos in that location.)

Inside that folder, the structure is dataset(name or acronym)/language(3char-abbreviation)/annotationidgloss.mp4

/ASL/eng/HOUSE.mp4

To figure out where the video should be put, it is looked up as:

                glosses = Gloss.objects.filter(lemma__dataset=dataset, annotationidglosstranslation__language=language,
                                             annotationidglosstranslation__text__exact=filename_without_extension)

So conceivably, you could upload a bunch of videos to the particular folder. (Somehow.) Then run the command. (Perhaps as a cron job.) I need to see how that works. This was a command @ocrasborn would use to exchange videos with other researchers.

Ok that would be a good start, with nodejs i can start to batch upload videos with delay between them to keep server load light. Is the url usable now? I have tried this https://signbank.cls.ru.nl/dictionary/import_videos/ASL/eng/HOUSE.mp4 but it doesnt work. We have some videos to upload soon.

susanodd commented 6 months ago

Okay, I'll investigate. We haven't used the command since @ocrasborn used it.

@Jetske is developing tests for video upload. So we can include other urls in the tests to make sure they work. Primarily for different video formats that were causing bugs with converting to mp4 and extracting an image for the gloss.

The tests will also check that making a backup of the previous video works as expected. That is definitely needed as you actively add more videos via a url. (Backups are made but the naming convention of the backed up files sometimes creates way too many extensions. That is invisible to the end user, but the file names and files are all stored, also ones during repeated upload attempts.)

In the Gloss Edit View, if you upload a video and there is already a video (even one that did not succeed), there may be an entry in the GlossVideo table that refers to a file. So sometimes it is necessary to do a "delete" (of something that is not visible in the display because, say, it had a format that can't be displayed in your browser), so it is (mysteriously) necessary to do a delete before a new upload will succeed. You may have noticed there are quite a few issues about uploading videos. That's what the tests will help debugging the code.