UCREL / FreeTxt-Flask

1 stars 2 forks source link

The FreeTxt app

Here is the link to FreeTxt app which is currently under development.

Building with Docker/Podman (as a User)

Create and publish a Docker image

This is the recommended way to build FreeTxt2 for normal usage. If you need to develop with FreeTxt please see the following section(s) after this one.

Step 1: Install Docker Desktop or Podman Desktop

Pick one of the following products (either will work!) and follow their setup instructions as detailed on one of these links:

Step 2: Run the ghcr.io/ucrel/freetxt-flask:main image

Open a console or terminal, and run:

docker run -p 8000:8000 --rm ghcr.io/ucrel/freetxt-flask:main

for Docker Desktop or:

podman run -p 8000:8000 --rm ghcr.io/ucrel/freetxt-flask:main

for Podman Desktop.

After the image has downloaded and the container started up, you should be able to visit http://localhost:8000/ to access FreeTxt2.

The cache folder and working directory are available on the /cache and /var/freetxt paths respectively, and can be mounted to your host if required through bind mounts or volume mounts.

If you plan to run FreeTxt2 for very long periods of time, you may want to mount /cache as this will include any uploaded text for processing, so can become quite large over time. Bind mounting this will let you periodically clear the folder without restarting the container.

Building a local docker image (as a Developer)

To build a local image (for development) simply clone this repository, then from the root directory of the clone run:

docker build -t freetxt-local .

or:

podman build -t freetxt-local .

(Note the . at the end is important!)

FreeTxt-Flask Setup Instructions (as a Developer)

Follow these steps to set up the project on your local machine without a container runtime.

Prerequisites

Before you start, make sure you have installed:

Step 1: Clone the Repository

Clone the FreeTxt-Flask repository from GitHub to your local machine:

git clone https://github.com/UCREL/FreeTxt-Flask.git
cd FreeTxt-Flask

Step 2: Set Up a Python Virtual Environment

It's recommended to use a Python virtual environment for project dependencies to avoid conflicts with other projects or system-wide Python packages.

Create a Virtual Environment

For Unix/Linux/macOS:

python3 -m venv venv

For Windows:

python -m venv venv

Activate the Virtual Environment

On Unix/Linux/macOS:

source venv/bin/activate

On Windows (Command Prompt):

.\venv\Scripts\activate

On Windows (PowerShell):

.\venv\Scripts\Activate.ps1

Step 3: Install Project Dependencies

With your virtual environment activated, install the project dependencies using:

pip install -r requirements.txt

Step 4: Configure the Flask Application

Inform Flask about the entry point of your application by setting the FLASK_APP environment variable:

Unix/Linux/macOS:

export FLASK_APP=main.py

Windows (Command Prompt):

set FLASK_APP=main.py

Windows (PowerShell):

$env:FLASK_APP = "main.py"

Step 5: Run the Flask Application

Now you're ready to run the application:

flask run

This will start a local web server. By default, the Flask application will be accessible at http://127.0.0.1:5000/ from your web browser.


SentimentAnalyser Class

Overview

SentimentAnalyser is a Python class that leverages a pre-trained BERT model for sentiment analysis on textual data. It can process texts in different languages and is equipped to handle both English and Welsh languages specifically.

Initialisation

Initializes the sentiment analysis model and tokenizer upon instantiation.

Methods

preprocess_text(text)

analyse_sentiment(input_text, language, num_classes, max_seq_len=512)

generate_scattertext_visualization(dfanalysis, language)

This class is suitable for projects requiring detailed sentiment analysis and visualization, especially in bilingual contexts.

KWICAnalyser Class

Overview

KWICAnalyser is a Python class tailored for Keyword-in-Context (KWIC) analysis, semantic tagging, and collocation analysis in textual data. It offers comprehensive functionality for textual analysis, especially useful in linguistic and semantic studies.

Initialisation

Initializes the class with either a text string or a pandas DataFrame. It preprocesses the text, performs semantic tagging, and prepares it for further analysis.

Methods

get_kwic(keyword, window_size, max_instances, lower_case, by_tag, by_sem)

get_top_n_words(remove_stops, topn)

get_collocs(kwic_insts, topn)

plot_coll_14(keyword, collocs, output_file)

get_collocation_strength(keyword, topn, window_size, by_tag, by_sem)

tag_semantics(text)

get_sorted_unique_tags()

get_word_frequencies()

This class is essential for in-depth text analysis, providing tools for KWIC analysis, collocation strength measurement, and semantic tagging, making it particularly valuable in linguistic research and text processing applications.

WordCloudGenerator Class

Overview

WordCloudGenerator is a Python class for creating visually appealing word clouds with custom shapes, colors, and filtering options. It is capable of semantic tagging, handling different languages, and applying various statistical measures for word selection.

Initialisation

Methods

load_image(image_file)

preprocess_data(data, language)

Pymsas_tags(text)

calculate_measures(df, measure, language)

filter_words(word_list)

get_wordcloud(dataframe, metric, word_list, cloud_shape_path, cloud_outline_color, cloud_type)

compute_word_frequency(tokenized_words, language)

generate_wordcloud_type(input_data, cloud_type, language, cloud_measure, wordlist=None)

generate_wordcloud(cloud_shape_path, cloud_outline_color, cloud_type, language, cloud_measure, wordlist={})

This class is particularly useful for linguistic and textual data visualization, offering versatile word cloud generation capabilities for a wide range of applications.

Summariser class

run_summarizer Function Overview

The run_summarizer function is designed to provide a concise summary of a given text using the TextRank algorithm. This function is versatile, as it can accept both strings and iterable objects as input. If the input is not a string, the function converts it into one. The core of this function lies in its ability to adjust the length of the summary based on the chosen_ratio parameter, which dictates the proportion of the original text to be included in the summary.

Key Features

LanguageChecker Class

Overview

LanguageChecker is a Python class designed for efficient language detection and segregation within pandas dataframes. It is particularly useful for processing multilingual text datasets and can specifically handle English and Welsh texts.

Initialisation

Methods

detect_language_file(text)

detect_and_split_languages()

This class is an essential tool for projects dealing with bilingual or multilingual datasets, offering streamlined processing for language-specific analysis or operations.

Contacts

Creative Commons Licence

Citation

If you use FreeTxt in your research or project, please cite it as follows:

Khallaf, N., Ezeani, I., Knight, D., Rayson, P., El-Haj, M. and Morris, S. (2023). FreeTxt – A Bilingual Free-Text Analysis and Visualisation Toolkit [Software]. Cardiff University and Lancaster University. Available at: www.freetxt.app