josh-chamberlain commented 1 year ago

Context

v1 Existing doccano instance

Our volunteer @nfmcclure made us a doccano instance, which helped us label hundreds of URLs. This is an update to that original code, or a fresh start if needed. We need to label more data sources, and our next version of the pipeline needs to be more user-friendly to answer as many volunteer questions as possible.

Doccano instance: http://35.90.222.49:8000/projects/1 Comment or DM for access.

Doccano alternatives

Since we're starting fresh, we should probably use something like labelstudio. It's more fully featured. They support labeling rendered HTML, not just full text. This could really help us label a wide variety of things.

Requirements

[x] set up a labeling pipeline for URLs where users are selecting record type
[x] something we can host on PDAP's digitalocean
[x] updated labels (most current taxonomy)
- add Inaccessible for pages which are inaccessible, broken, null, 404, URL resolves to a file location of a jpeg/image, languages other than english. These should be pruned from the training data.
- combine Calls for Service and Dispatch Logs to Calls for Service & Dispatch Logs
- add Budgets & Finances
- Not Criminal Justice Related → Not relevant; Poor Data Source → Relevant, but not a data source
- add Individual Record as a boolean which defaults to FALSE but can be TRUE if clicked during annotation
  - for example, an individual Media Bulletin or Court Case docket, where the parent page listing all such documents would be the Data Source
  - this could be considered separate from the record type label
  - this is the only one that should be combined with other labels—we should visually separate it if possible
[x] one-page annotation, without leaving the page
- can be accomplished by including HTML content/tags on the annotation or rendering HTML content
[x] if possible, group the labels for record types according to the taxonomy
[ ] Alphabetize / otherwise order URLs during annotation to make human identification easier. i.e. getting a bunch of URLs from the same agency in a row.

Docs

[x] what's the workflow for new users to get access for labeling? do we need to make users, or can people self-serve sign up? can people submit new training data as a hugging face PR?

mbodeantor commented 1 year ago

We should probably own user creation so we have some visibility into who's making changes

nfmcclure commented 1 year ago

I'll try to find some time soon (probably this week or next) to help migrate the instance over to digital ocean.

josh-chamberlain commented 1 year ago

thanks, @nfmcclure ! Much appreciated. Our pattern for other apps is to have a github repo on our org, set to auto-deploy over on DO. I'm guessing you won't have the permissions you need, and PDAP staff are happy to pick up where you get stuck.

josh-chamberlain commented 11 months ago

@nfmcclure a nudge, any update on the source code? we're happy to do any migration work, if that would help.

maxachis commented 8 months ago

@josh-chamberlain I'm willing to help tackle this. I have a few things I'll need to fully complete the task.

[x] 1. I'd need a representative sample of the urls that would be used for this, and a sense of where they are coming from (similar to my request in #16 ).
[x] 2. I'd need to know if this Doccano instance still exists, or if something else exists that can give me a foundation for next steps, or if I'll need to start from scratch. As is, I'll assume scratch.
[x] 3. I'll probably need access to the PDAP DigitalOcean account to get a better sense of that environment. Could just stick with readonly for now, if that's an option. In the interim I can form my own DigitalOcean and explore the resource.

maxachis commented 8 months ago

As I understand it, this annotation workflow will include several components

[ ] 1. Identification of the data source that we will be extracting urls from.
[x] 2. Ingestion and preprocessing of received urls. This includes capturing rendered html. I would expect that for relevant pages, this content will be fairly static, but there is a possibility for more dynamic content. For the moment, I'm leaving that out of the equation.
[ ] 3. Storing that data somewhere that it can be easily retrieved by the Label Studio instance.
[x] 4. Designing the task in Label Studio to include both the URL and the rendered HTML for context, along with the labels to use as included in the current taxonomy. Must ensure instructions are clear and concise on how to use the interface and apply labels accurately.
[x] 5. Hosting the Label Studio instance somewhere (likely DigitalOcean) that volunteers can easily access it and perform the requisite operations.
[ ] 6. Storing the output of the volunteers' efforts somewhere that it can be easily retrieved by necessary downstream processes. Likely in a JSON or CSV format.

maxachis commented 8 months ago

Began working on Label Studio. Here is a very rough version of what the labeling process could look like. I gave it the url and the HTML content. It proceeded to render that HTML content and display it along with the url. I then gave a few sample options (not referencing the Taxonomy) to indicate how these choices might play out. Depending on the size of the options, I imagine it would be trivial to configure the option selection differently.

Label Studio can be configured to accept Source (Input) and Target (Output) Databases. Unfortunately, the non-local file options available for both are rather limited. I would need to know what our preferred method of storage would be, and what information I would need in order to properly load preprocessed data into the Source Database

The process for setting up Label Studio on Digital Ocean seems fairly simple.

[x] 1. Install Label Studio package through the means most convenient (for example, Pip if Python is already present)
[x] 2. Spin up Label Studio instance with project config already developed, or a modified version thereof
[ ] 3. Through the app interface, configure Source and Target Databases
[ ] 4. Identify link that will enable public access to the Label Studio task.

Additional, lower priority tasks include:

[ ] Set up backup for source and target databases
[ ] Set up monitoring for app availability.

My Next Task

[x] I will develop the preprocessing tools needed to retrieve HTML content from URLs ahead of time and then combine them with the urls as json data, to be uploaded to the Source Database.

Current Blockers (cc: @josh-chamberlain )

[ ] What would the Source and Target Databases be?
[ ] How could I upload json data into the Source database?
[x] How can I or someone else upload this into Digital Ocean and hook it up properly?

maxachis commented 8 months ago

38 is a draft pull request for the preprocessing pipeline, which would be included in the data source identification repository. Being separate from the rest of the logic in that repository, it could be easily moved elsewhere. However, because it's associated with this issue within the repo, I'm keeping it here for now.

The pipeline is relatively simple -- it loads the data from the relevant source, harvests the html data for each url and groups each with their respective urls, and then uploads that data to the relevant target. The source and target classes are designed to be easily substituted for the actual classes, and exist currently as placeholders and to guide the final implementation.

Once I obtain the information about the source and target databases discussed in the blockers above, I will be able to complete this and submit it as a full pull request.

My Next Tasks

[x] Develop unit/integration testing for preprocessing pipeline
[x] Complete alpha version of Label Studio configuration, implementing based on specifications listed in requirements
[x] Save configuration somewhere easily accessible by self and other parties, to eventually be included in the Digital Ocean setup.

josh-chamberlain commented 8 months ago

@maxachis yes, setting up in DigitalOcean should be pretty simple! In general we'll get things merged into a GitHub repo and working locally across our different machines, then Marty will push to a DO droplet.

I think it's up to @mbodeantor how to set up the target and source databases—I'm open to discussing this tomorrow if needed.

mbodeantor commented 8 months ago

@maxachis @josh-chamberlain Yeah I would like to discuss, I'm unclear about the use case of Label Studio vs Doccano

josh-chamberlain commented 8 months ago

@mbodeantor either way, we'll need to start from scratch—we can't get hold of the source code for Doccano, though from what I understand we had a pretty vanilla implementation.

The key difference is that Labelstudio allows for labeling richer content like embedded images / rendered html, whereas Doccano is text-only. Labelstudio also seems to have a more mainstream user base, better documentation, etc.

maxachis commented 8 months ago

label_studio_config.zip This file contains the HTML for the Expert Instruction, a JSON for the data types used, and a JSON and XML template version of the label config. In theory, this should be the primary stuff we need to quickly set up Label Studio on Digital Ocean.

josh-chamberlain commented 8 months ago

@maxachis nice, let's wait for @mbodeantor to make sure he's on board with using Label Studio.

maxachis commented 8 months ago

Currently working on setting up Droplet on Digital Ocean. Outstanding tasks.

[ ] 1. Figure out how to point domain name to IP address
[ ] 2. Enable SSL Certificate of Domain Name
[x] 3. Setup Label-Studio with Annotation Task configuration
[ ] 4. Configure storage for Droplet
[ ] 5. Go live.

maxachis commented 8 months ago

Label Studio version is online, of sorts. Can be accessed at http://167.71.177.131:8080

Hitting a possible blocker here in terms of Role Based Access Control -- specifically, the free version of Label Studio doesn't have it. That means that anyone who accesses our Label Studio instance can, if they are unscrupulous, change any settings they want in the project.

Now if we're comfortable with the honor system, that's not a problem. If we are, that would require some different restructuring.

The enterprise version of Label Studio would avert this problem, and would allow other functionality as well. However, the pricing for Enterprise is unknown -- we'd have to contact sales about it.

@josh-chamberlain My current question is as follows:

[ ] Are we comfortable with these limitations for Label Studio? Or do we want to seek other alternatives, either Doccano or Label Studio Enterprise?
[x] Alternatively, are we comfortable with simply creating our own? Our annotation needs aren't complex, and efforts are likely similar to what I'm working on in #17

josh-chamberlain commented 8 months ago

@maxachis interesting. I reached out to them for pricing info.

Can we limit access to the entire instance, to at least control who can sign up? Allowing anyone to sign up is scary. If we have some control over who can sign up, that's different. If people have to come to us to get an account, I am comfortable with honor system for now—provided we periodically extract labeled content in case someone messes up / sabotages us. This worked just fine with Doccano.

I'm strongly against creating our own annotation pipeline from scratch. We can find a different free/cheap one which has RBAC if necessary...we are not the first to face this problem.

maxachis commented 8 months ago

There might be ways to limit access to the entire instance, but that would probably require us implementing things that go beyond the capabilities of free Label Studio. One option could be finding a way to dockerize our particular configuration of Label Studio and then hand that out to volunteers. If we configure it to point to cloud target and source databases, then in theory they would just need to spin it up and get started. But this wouldn't stop them from sharing it around or enable us to track their activity. And it does add an extra step (or several) to getting someone to help with annotation.

My personal opinion is that if we're willing to go to such length, then considering another annotator would probably be more effective for the effort.

maxachis commented 8 months ago

Additionally, in case this affects consideration of whether to use Label Studio, I will point out that the feature of displaying HTML would probably run into some problems if we tried to display the HTML of web pages we're looking at. Since sometimes the HTML of these pages rely on relative addressing, rendered HTML content absent the context of the web server it comes from might sometimes appear broken. Not always, but sometimes.

josh-chamberlain commented 8 months ago

@maxachis let's keep it simple and just try to annotate the text then: the URL plus meta and header content being collected by the tag collector. we can save displaying the page content for a future enhancement.

josh-chamberlain commented 8 months ago

regarding auth/users, I do have a call with labelstudio tomorrow to find out about enterprise pricing. I suggest:

we sit tight until tomorrow, because using labelstudio would be neat (flexibility, broad userbase, etc)
if I get bad news about auth / users, we can just use doccano, or maybe BRAT or INCEpTION

josh-chamberlain commented 7 months ago

We have a labelstudio 2 week trial. Some things I'd like to test:

[ ] annotation experience from perspective of different roles
[ ] assignment + correction process for annotations
[ ] instead of labeling from scratch, use ML-generated suggestions for users to accept/reject

maxachis commented 7 months ago

[ ] annotation experience from perspective of different roles

@josh-chamberlain What are the full suite of roles we're envisioning here? Based on the interface, I can see two right off the bat:

An individual assigning labels without any pre-existing labels
Reviewers accepting/rejecting labels that already exist.

Any others I might be missing?

maxachis commented 7 months ago

[ ] assignment + correction process for annotations

I've created a simple URL taxonomy labeling task based on my original design, which anyone can try out easily enough. This is accessible on the project website as "URL Labeling and Annotation".

[ ] instead of labeling from scratch, use ML-generated suggestions for users to accept/reject

The relevant documentation on pre-generated predictions can be found here. The video example provided shows someone modifying a JSON file with a pre-existing prediction, which is then displayed in the annotation task as a pre-selected option. It does not show someone being able to accept or reject an option as though the annotation task has already been performed.

Thus, I have a few questions which I will investigate but which are also worth asking to the Label Studio team during free trial check-in

[ ] 1. Do pre-annotations exist only as annotation tasks with options filled in, as opposed to review-ready tasks which skip the annotation task entirely? Put another way, can we import pre-annotated data such that it goes directly to the review (accept/reject) portion of the pipeline?
[ ] 2. The prediction tutorial video showed pre-annotated data being made by modifying a JSON import file with the relevant prediction data. Does this imply that pre-annotated data can only come in this specific JSON format? For example, would I not be able to import pre-annotated data via csv?

I will note that if the answer to 1 is that we can't skip the annotation task directly to the review/accept/reject portion, we could nonetheless make a workaround -- for example, by displaying a URL, the predicted classification, and a binary Approve/Reject option. And, if I'm understanding things correctly, we can bypass the review process such that these pseudo-review annotations are not reviewed a second time. This would also allow us to create a workaround for 2 as well -- in this case, the pre-annotated data is treated as contextual information, and the label is the manual "Approve/Reject".

maxachis commented 7 months ago

Another point of interest is being able to integrate a machine learning backend with Label Studio and creating an automated active learning loop. This could synchronize well with #41, "Make training happen on digital ocean". I would need to investigate the implementation further, however. And I may benefit from @EvilDrPurple and @mbodeantor 's insight into the machine learning pipeline and how easily we could integrate that into a Digital Ocean/Label Studio union.

I'm currently playing with their ML Loop example, and y'all can follow along with my forked version of the repo here if you're curious.

josh-chamberlain commented 7 months ago

@maxachis re: roles, 3. someone getting data into/out of label studio, or otherwise integrating with hugging face or the API

maxachis commented 7 months ago

Findings on Annotator and Reviewer Roles

cc: @josh-chamberlain

Administrators can create project and indicate • That annotators are manual • OR that annotations are automatically assigned to whomever next performs an annotation.

Assigning manual reviewers is fairly easy and intuitive. If I log in as annotator, I only see the projects I’m added to. Thus, for annotations, several steps need to be done for access:

Project needs to be created (obviously)
Project needs to be in a non-Sandbox workspace (somewhat less obviously)
Project, after being created and moved into a non-Sandbox workspace, must be published (more less obvious)
The user needs to be invited to the space and assigned as annotator
The user needs to be added to the given project
If the project involves manual annotations, the user needs to be manually assigned those annotations.

User experience as an Annotator is very user-friendly. Simple as click and go.

• Annotators can submit or skip. • Annotators can revisit tasks they previously performed.

Similar experience with reviewers. However, depending on the setting, reviewers can also annotate, so pay attention to settings.

I’d additionally note the project dashboard as providing useful information, such as how long it takes people to complete a task. Recommend looking at that for longer.

maxachis commented 7 months ago

@maxachis re: roles, 3. someone getting data into/out of label studio, or otherwise integrating with hugging face or the API

I'll look into this next. As I said before, there is the option for machine learning integration, but there also appear to be simpler options that can involve either manual import/export of data, or else hooking it up to cloud-based storage options such as Amazon S3.

maxachis commented 7 months ago

Note that the number of data sources available for Cloud Source Database storage (both for Source and Target database) is limited to:

Amazon S3
Microsoft Azure
Google Cloud Storage
Redis

maxachis commented 7 months ago

I've been able to set up a source data pipeline that can automatically pull in data for a particular project. A few observations:

If providing the data via Label Studio JSON objects (which seems to be the preferred way of doing it) Proper formatting is critical -- improper formatting, including the wrong key-values, or an object that is misconfigured with the labeling template will cause bugs in the system which are not easily diagnosed and where the solution was sometimes to delete the project and start over.
Fortunately, each individual JSON import object is fairly simple to create -- I was able to create functioning tasks using a single json object containing a required key called "data", and then the key-value pairs for the information that would be presented to the annotator.
I was able to set up the source data pipeline using a personal Amazon S3 account. I tried to use the Redis option by creating (and later destroying) a Redis database on Digital Ocean (which would have been pricier than S3, by my understanding) and was unable to do so. That could as much speak to my own inexperience with Redis databases as with the compatibility of Digital Ocean's Redis with Label Studio, however.

maxachis commented 7 months ago

I'll additionally point out that Label Studio has an API which seems like it could be useful, albeit with some limitations:

https://labelstud.io/api

This might make components such as setting up users, linking to specific projects, and so forth easier.

UPDATE: Removed portion expressing uncertainty about whether we can directly assign roles to user via the API -- I have confirmed that we can.

mbodeantor commented 7 months ago

Looks like we can import data through the API: https://labelstud.io/api#tag/Import/operation/api_projects_import_create

maxachis commented 7 months ago

We can also export the data similarly through the API. These would probably be the better options to take, as opposed to hooking them up to cloud providers. Helps keep things more flexible.

maxachis commented 7 months ago

@josh-chamberlain @mbodeantor I have created a draft pull request at #47 that can serve as a proof-of-concept for demonstrating how to transfer data into and out of the project, utilizing the API. If we wish to go forward with Label Studio, this can be used as a starting point for further modifications.

I'll next work on modifying the data to test import pre-annotated data, to simulate what could be done with a machine learning pipeline.

maxachis commented 7 months ago

Observations from trial

Label Studio's user interface is quite intuitive and easy to use. For the annotator/reviewer, I anticipate the process would be quite smooth.
The API is quite powerful, capable of migrating data in and out of label studio, updating user roles, and a whole host of other actions. The documentation, similarly, is (mostly) quite clear and useful. However, many of the API actions are atomic -- more complex actions, such as rotating user roles to ensure we stay below our seating limit, will require substantial backend to chain together multiple API actions.
Configuration of the projects and tasks is the primary pain point -- the XML templates for setting up a project are useful, but not optimally documented, and errors are not always clear. Similarly, data must be precisely configured, as errors in the data can cause errors in the application which are not always easy to identify and diagnose. This is an issue that will be most present in the setup of the task, but seemingly won't cause too much issue afterwards. But we do need to be certain that the task is properly configured and all task data going into it is properly set up.

maxachis commented 7 months ago

On the active learning functionality and deeper ML integration

It is interesting, and I think we could benefit from utilizing it. By selecting only the samples our machine learning model is most uncertain about, we could solve the problem we've had of certain training data being underrepresented. However, that process would require a more complicated setup, and probably would benefit from having an active learning setup already developed. Thus, it might not be useful to explore right now, given the limited amount of time we have on this trial.

Additionally, the documentation for the machine learning portion is lacking and in some cases appears to contain contradictions. For example: model.py in the Github repository for the Label Studio ML Backend includes two methods, fit() and predict(), which include parameters which are not included in the dummy model example, despite the dummy model indicating that the class defined in model.py is the parent class. Since these two methods are apparently the means by which the machine learning backend would interface with Label Studio, it's a concern to me that, between these two sources, I don't know which to treat as authoritative.

maxachis commented 7 months ago

Questions for Label Studio Team

I'll update this comment with additional questions as I progress:

[ ] 1. How up-to-date is the documentation for the machine learning backend? model.py in the Github repository for the Label Studio ML Backend includes two methods, fit() and predict(), which include parameters which are not included in the dummy model example, despite the dummy model indicating that the class defined in model.py is the parent class. This leaves me wondering what is the authoritative resource for developing an integrated ML backend.
[ ] 2. Do pre-annotations exist only as annotation tasks with options filled in, as opposed to review-ready tasks which skip the annotation task entirely? Put another way, can we import pre-annotated data such that it goes directly to the review (accept/reject) portion of the pipeline?
[ ] 3. The prediction tutorial video showed pre-annotated data being made by modifying a JSON import file with the relevant prediction data. Does this imply that pre-annotated data can only come in this specific JSON format? For example, would I not be able to import pre-annotated data via csv?
[ ] 4. Can you provide an example of how to import data with pre-annotations via the API? The example given does not include pre-annotations, and it's unclear if importing pre-annotations via the API is the same as importing pre-annotations via uploading a JSON file through the interface. Both annotations and data can apparently be included via the import command, but it's unclear whether all information in the described annotation array must be included, and how to associate each distinct annotation with each distinct task.

maxachis commented 7 months ago

Creating/Rotating Users

Can be done using the API. I've linked to the relevant commands

I updated my PR to include functionality for updating a member's role, as well as an integration test demonstrating this.

josh-chamberlain commented 7 months ago

We got some responses:

It was nice meeting you today. To follow up on our conversation today, here are the answers to your questions:

Your assumptions were correct here. (the dummy model is out of date)
You can achieve this by inserting the pre-annotations in the "annotations" key in the json file format with the user information, if needed. Here is the doc for example. And you can leave the "prediction" key empty.
Correct, pre-annotations are only supported in JSON format, at the moment, we do not support CSVs for pre-annotations
I have attached in this email a json file sample where you can load pre-annotations as submitted annotations. Keep in mind, if you omit the annotator info, the software will automatically show the user that created the tasks as the annotator. Here is a sample python code for the API call to create the tasks from the json:

import requests
import json

def load_json(input_file):
    with open(input_file) as f:
        json_file = json.load(f)
    return json_file

api_key = "your_api_key"

project_id = 00000

url = f"https://app.heartex.com/api/projects/{project_id}/import"

# or just load the json directly from the script as a dict
data =  load_json("json_imports/fish_anno.json")

headers = {
    "Authorization": f"Token {api_key}",
    "Content-Type" : "application/json"
}

response = requests.post(
    url=url,
    headers=headers,
    json=data
)

print(f"Response code: {response.status_code}, Response body: {response.text}")

fish_anno.json

maxachis commented 7 months ago

I'm working on creating code that can convert our data into the requisite format for pre-annotations.

Bear in mind, the data must be in a very precise format, which is not always optimally documented.

I may also need @EvilDrPurple 's insight as to how the label data as output from the ML pipeline is currently represented, as I will need to know how to convert data from that format into Label Studio's bespoke format.

maxachis commented 7 months ago

Can confirm I've successfully been able to import pre-annotated data into Label Studio, which can then be reviewed and thereby bypassing the annotation stage. In other words, we can create a full pipeline with either unannotated or pre-annotated data.

My next priority will be to create an example pipeline, using fake data, that people can run which can illustrate how it would work.

I'll be putting aside a demonstration of the programmatic user rotation functionality (which we'd need to decide if we want to pursue) as well as the active machine learning, which is not Minimum Viable Product for this issue.

maxachis commented 7 months ago

I have created and linked #47, a draft PR that at the moment mainly exists to demonstrate the functionality of Label Studio and how it would look to utilize it in a (simplified) pipeline.

@josh-chamberlain @mbodeantor I invite y'all to check it out and see

If you can run basic_demonstration.py successfully
If the workflow makes sense and is what you'd expect.

maxachis commented 7 months ago

@josh-chamberlain Since it's been over a week, I wanted to additionally ping you on this, in case it got lost in the shuffle.

josh-chamberlain commented 7 months ago

@maxachis sorry about the delay, I wasn't getting notifications. I'm looking at this now.

josh-chamberlain commented 6 months ago

@maxachis I made a project called Labeling interface

Aside from the fact that the actual text we're labeling will look different, this is what I was expecting the process to look like. What do you think about making this canonical, and calling this issue closed? If the project is set up, we just need to hit it with tasks.

It's easy enough to make these 3 separate labeling tasks—but I think it's better if each URL only goes through the pipeline once, because it takes time for someone to read and understand what they're looking at.

screencast 2024-04-03 14-29-03

maxachis commented 6 months ago

@maxachis I made a project called Labeling interface

Aside from the fact that the actual text we're labeling will look different, this is what I was expecting the process to look like. What do you think about making this canonical, and calling this issue closed? If the project is set up, we just need to hit it with tasks.

It's easy enough to make these 3 separate labeling tasks—but I think it's better if each URL only goes through the pipeline once, because it takes time for someone to read and understand what they're looking at.

@josh-chamberlain This interface is wayyy better looking and comprehensible than what I came up with, so no complaints there.

I also have no issue with having this be one task, and in fact thinks it's probably considerably easier for the user as well that way.

I'm also happy to close this issue. I think after this we'd just need to create one or two issues for the process of ETL'ing data into and out of this.

Police-Data-Accessibility-Project / data-source-identification

Annotation workflow v2 #19

Context

v1 Existing doccano instance

Doccano alternatives

Requirements

Docs

My Next Task

Current Blockers (cc: @josh-chamberlain )

My Next Tasks

Findings on Annotator and Reviewer Roles

Observations from trial

On the active learning functionality and deeper ML integration

Questions for Label Studio Team

Creating/Rotating Users