Here is the link to FreeTxt app which is currently under development.
FreeTxt was developed as part of an AHRC funded collaborative FreeTxt supporting bilingual free-text survey and questionnaire data analysis research project involving colleagues from Cardiff University and Lancaster University (Grant Number AH/W004844/1).
The team included PI - Dawn Knight; CIs - Paul Rayson, Mo El-Haj; RAs - Nouran Khallaf, Ignatius Ezeani and Steve Morris. The Project Advisory Group included representatives from: National Trust Wales, Cadw, Museum Wales, CBAC | WJEC and National Centre for Learning Welsh.
For further information on the FreeTxt project please contact the project team: CorCenCC@Cardiff.ac.uk
This is the recommended way to build FreeTxt2 for normal usage. If you need to develop with FreeTxt please see the following section(s) after this one.
Pick one of the following products (either will work!) and follow their setup instructions as detailed on one of these links:
ghcr.io/ucrel/freetxt-flask:main
imageOpen a console or terminal, and run:
docker run -p 8000:8000 --rm ghcr.io/ucrel/freetxt-flask:main
for Docker Desktop or:
podman run -p 8000:8000 --rm ghcr.io/ucrel/freetxt-flask:main
for Podman Desktop.
After the image has downloaded and the container started up, you should be able to visit http://localhost:8000/ to access FreeTxt2.
The cache folder and working directory are available on the /cache
and /var/freetxt
paths respectively, and can be mounted to your host if required through bind mounts or volume mounts.
If you plan to run FreeTxt2 for very long periods of time, you may want to mount /cache
as this will include any uploaded text for processing, so can become quite large over time. Bind mounting this will let you periodically clear the folder without restarting the container.
To build a local image (for development) simply clone this repository, then from the root directory of the clone run:
docker build -t freetxt-local .
or:
podman build -t freetxt-local .
(Note the .
at the end is important!)
Follow these steps to set up the project on your local machine without a container runtime.
Before you start, make sure you have installed:
Clone the FreeTxt-Flask repository from GitHub to your local machine:
git clone https://github.com/UCREL/FreeTxt-Flask.git
cd FreeTxt-Flask
It's recommended to use a Python virtual environment for project dependencies to avoid conflicts with other projects or system-wide Python packages.
For Unix/Linux/macOS:
python3 -m venv venv
For Windows:
python -m venv venv
On Unix/Linux/macOS:
source venv/bin/activate
On Windows (Command Prompt):
.\venv\Scripts\activate
On Windows (PowerShell):
.\venv\Scripts\Activate.ps1
With your virtual environment activated, install the project dependencies using:
pip install -r requirements.txt
Inform Flask about the entry point of your application by setting the FLASK_APP
environment variable:
export FLASK_APP=main.py
set FLASK_APP=main.py
$env:FLASK_APP = "main.py"
Now you're ready to run the application:
flask run
This will start a local web server. By default, the Flask application will be accessible at http://127.0.0.1:5000/
from your web browser.
SentimentAnalyser
is a Python class that leverages a pre-trained BERT model for sentiment analysis on textual data. It can process texts in different languages and is equipped to handle both English and Welsh languages specifically.
Initializes the sentiment analysis model and tokenizer upon instantiation.
preprocess_text(text)
text
(str): The text string to preprocess.analyse_sentiment(input_text, language, num_classes, max_seq_len=512)
input_text
(str): The text to analyze.language
(str): The language of the text ('en' for English, 'cy' for Welsh).num_classes
(int): Number of sentiment classes (3 or 5).max_seq_len
(int, optional): Maximum sequence length for tokenization.generate_scattertext_visualization(dfanalysis, language)
dfanalysis
(pd.DataFrame): DataFrame containing the sentiment analysis results.language
(str): The language for the visualization.This class is suitable for projects requiring detailed sentiment analysis and visualization, especially in bilingual contexts.
KWICAnalyser
is a Python class tailored for Keyword-in-Context (KWIC) analysis, semantic tagging, and collocation analysis in textual data. It offers comprehensive functionality for textual analysis, especially useful in linguistic and semantic studies.
Initializes the class with either a text string or a pandas DataFrame. It preprocesses the text, performs semantic tagging, and prepares it for further analysis.
get_kwic(keyword, window_size, max_instances, lower_case, by_tag, by_sem)
keyword
(str): The target keyword.get_top_n_words(remove_stops, topn)
get_collocs(kwic_insts, topn)
plot_coll_14(keyword, collocs, output_file)
get_collocation_strength(keyword, topn, window_size, by_tag, by_sem)
tag_semantics(text)
get_sorted_unique_tags()
get_word_frequencies()
This class is essential for in-depth text analysis, providing tools for KWIC analysis, collocation strength measurement, and semantic tagging, making it particularly valuable in linguistic research and text processing applications.
WordCloudGenerator
is a Python class for creating visually appealing word clouds with custom shapes, colors, and filtering options. It is capable of semantic tagging, handling different languages, and applying various statistical measures for word selection.
input_data
, which can be a DataFrame or raw text.load_image(image_file)
preprocess_data(data, language)
Pymsas_tags(text)
calculate_measures(df, measure, language)
filter_words(word_list)
get_wordcloud(dataframe, metric, word_list, cloud_shape_path, cloud_outline_color, cloud_type)
compute_word_frequency(tokenized_words, language)
generate_wordcloud_type(input_data, cloud_type, language, cloud_measure, wordlist=None)
generate_wordcloud(cloud_shape_path, cloud_outline_color, cloud_type, language, cloud_measure, wordlist={})
This class is particularly useful for linguistic and textual data visualization, offering versatile word cloud generation capabilities for a wide range of applications.
The run_summarizer
function is designed to provide a concise summary of a given text using the TextRank algorithm. This function is versatile, as it can accept both strings and iterable objects as input. If the input is not a string, the function converts it into one. The core of this function lies in its ability to adjust the length of the summary based on the chosen_ratio
parameter, which dictates the proportion of the original text to be included in the summary.
chosen_ratio
parameter, with a minimum threshold set at 0.1.LanguageChecker
is a Python class designed for efficient language detection and segregation within pandas dataframes. It is particularly useful for processing multilingual text datasets and can specifically handle English and Welsh texts.
data
): The pandas dataframe to be processed.column
): The name of the column in the dataframe containing the text.detect_language_file(text)
text
(str): The text string for language detection.None
if the detection fails.detect_and_split_languages()
'en'
) and the other with all Welsh texts ('cy'
), if present.This class is an essential tool for projects dealing with bilingual or multilingual datasets, offering streamlined processing for language-specific analysis or operations.
If you use FreeTxt in your research or project, please cite it as follows:
Khallaf, N., Ezeani, I., Knight, D., Rayson, P., El-Haj, M. and Morris, S. (2023). FreeTxt – A Bilingual Free-Text Analysis and Visualisation Toolkit [Software]. Cardiff University and Lancaster University. Available at: www.freetxt.app