[Theory] Google Colab - Githubissues

Google Colab (short for "Colaboratory") is a free cloud-based platform provided by Google that allows users to write and execute Python code in a Jupyter notebook environment. It is particularly popular among data scientists, machine learning practitioners, and educators for several reasons:

Cloud-Based: Since it is hosted on the cloud, you do not need to install any software on your local machine. You can access your notebooks from any device with an internet connection.
Free Access to GPUs and TPUs: Google Colab offers free access to powerful hardware accelerators like GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units), which significantly speed up the execution of complex computations, especially those involved in deep learning and large-scale data processing.
Jupyter Notebook Interface: The interface is similar to Jupyter Notebooks, which means you can write and run Python code in cells, visualize data with plots, and integrate Markdown for documentation.
Integration with Google Drive: You can easily save and manage your notebooks in your Google Drive, enabling seamless collaboration and sharing.
Pre-Installed Libraries: Google Colab comes pre-installed with many popular Python libraries for data science and machine learning, such as TensorFlow, Keras, PyTorch, Pandas, NumPy, and many more.
Collaboration: Multiple users can work on the same notebook simultaneously, making it an excellent tool for collaborative projects and teaching.
Easy to Share: You can share your notebooks with others via a simple link, and they can view or even edit the notebook depending on the permissions you set.

Google Colab is widely used for tasks such as prototyping machine learning models, conducting exploratory data analysis, and teaching programming and data science concepts.

Google Colab has certain quota limits to ensure fair usage and resource availability for all users. These limits can vary, especially for free users, and they may change over time. Here are some of the key quota limits:

Free Tier:

Runtime Duration:
- The maximum continuous usage of a single session is typically 12 hours. After this, the session will disconnect automatically.
- Idle sessions (sessions where there is no code execution or user interaction) can be disconnected after 30 minutes of inactivity.
Daily Usage Limits:
- The free tier has usage limits on the total amount of time you can use GPUs and TPUs. Exact limits are not publicly specified but can vary based on resource availability and usage patterns.
Hardware Specifications:
- The type of GPU provided can vary (e.g., NVIDIA K80s, T4s, P4s, and P100s). The specific GPU available to you can change dynamically based on demand.
- The free tier generally provides 12 GB of RAM, but this can sometimes increase to 25 GB if the notebook requires more memory.
Disk Space:
- Free users typically get around 100 GB of disk space, but this is a temporary storage that gets reset when the session ends or is recycled.

Colab Pro and Colab Pro+:

Google offers paid plans called Colab Pro and Colab Pro+ that provide additional resources and higher limits:

Colab Pro:
- Priority access to premium GPUs like the NVIDIA T4s and P100s.
- Longer maximum runtime (up to 24 hours).
- Increased usage limits for GPUs.
- More RAM (up to 32 GB).
Colab Pro+:
- Access to the best available GPUs, including NVIDIA V100s.
- Even longer runtimes and more generous usage limits.
- Even more RAM (up to 52 GB).

General Considerations:

Usage Policies: Google Colab monitors usage and may throttle or limit usage if it detects behavior that consumes an excessive amount of resources.
Background Execution: Notebooks do not run in the background once you close the browser tab, so you need to keep the tab open for your code to execute.

It's important to be aware that these limits are subject to change, and it's a good idea to check the Google Colab FAQ or the Google Colab Pro page for the latest information.

Working with Large Language Models (LLMs) in Python involves using various libraries that facilitate model implementation, data handling, and evaluation. Here are some of the most popular libraries:

1. Transformers

Description: Developed by Hugging Face, this library provides APIs and tools to train and use state-of-the-art pre-trained models such as BERT, GPT-2, T5, and more.

Usage:

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Example: Text generation
generator = pipeline('text-generation', model='gpt-2')
result = generator("Once upon a time", max_length=50)
print(result)

2. TensorFlow

Description: An end-to-end open-source platform for machine learning developed by Google. It is widely used for training and deploying deep learning models.

Usage:

import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer

# Example: Loading a pre-trained model
model = TFAutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

3. PyTorch

Description: An open-source machine learning library developed by Facebook's AI Research lab. It is known for its flexibility and ease of use in research and development.

Usage:

import torch
from transformers import AutoModel, AutoTokenizer

# Example: Loading a pre-trained model
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

4. Datasets

Description: Also developed by Hugging Face, this library provides a wide range of datasets and tools to load, preprocess, and manipulate them.

Usage:

from datasets import load_dataset

# Example: Loading a dataset
dataset = load_dataset('imdb')
print(dataset['train'][0])

5. Numpy

Description: A fundamental library for scientific computing in Python. It provides support for arrays, matrices, and many mathematical functions.

Usage:

import numpy as np

# Example: Creating an array
array = np.array([1, 2, 3, 4, 5])
print(array)

6. Pandas

Description: A powerful data manipulation and analysis library. It provides data structures like DataFrame for handling and analyzing structured data.

Usage:

import pandas as pd

# Example: Creating a DataFrame
data = {'name': ['Alice', 'Bob'], 'age': [25, 30]}
df = pd.DataFrame(data)
print(df)

7. Matplotlib and Seaborn

Description: These libraries are used for data visualization. Matplotlib is a plotting library, while Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive statistical graphics.

Usage:

import matplotlib.pyplot as plt
import seaborn as sns

# Example: Plotting data
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()

8. scikit-learn

Description: A machine learning library that provides simple and efficient tools for data mining and data analysis. It is built on NumPy, SciPy, and Matplotlib.

Usage:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Example: Training a model
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression().fit(X_train, y_train)

9. NLTK and spaCy

Description: Libraries for natural language processing tasks. NLTK (Natural Language Toolkit) is a comprehensive library for NLP, while spaCy is designed for efficient and easy use in production environments.

Usage:

import nltk
from spacy import load

# NLTK example: Tokenizing text
nltk.download('punkt')
from nltk.tokenize import word_tokenize
tokens = word_tokenize("Hello, how are you?")
print(tokens)

# spaCy example: Named Entity Recognition
nlp = load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
  print(ent.text, ent.label_)

These libraries provide a comprehensive toolkit for working with large language models and performing a wide range of machine learning and data processing tasks.

https://colab.research.google.com/drive/1wrdxxv1aczuFdjRpHREKxo8dx1LPfM3a#scrollTo=OoGw2U0696Ip

!pip install transformers torch

# Import necessary libraries
from transformers import pipeline

# Initialize the sentiment analysis pipeline with a specific model
senti_analysis = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Define the list of texts
texts = [
    "I do not like to eat cake, but I like the smell",
    "I like cake smell, but I do not like to eat it",
    "I like the cake",
    "I do not like the smell"
]

# Perform sentiment analysis on the texts
results = senti_analysis(texts)

# Print the results
for text, result in zip(texts, results):
    print(f"Text: {text}\nSentiment: {result['label']}, Score: {result['score']:.4f}\n")

By specifying both the task and the model, you ensure that the pipeline is configured correctly to provide reliable sentiment analysis results for your input texts.

The `pipeline` function for initializing the sentiment analysis

Parameters:

"sentiment-analysis"
model="distilbert-base-uncased-finetuned-sst-2-english"

Detailed Explanation:

"sentiment-analysis"
- Description: This is the name of the task you want to perform. In this case, it specifies that you want to use the pipeline for sentiment analysis.
- Function: The pipeline function supports various tasks such as "text-generation", "text-classification", "question-answering", etc. By specifying "sentiment-analysis", you are telling the pipeline to use a model that can classify the sentiment of a given text as positive, negative, or neutral.
model="distilbert-base-uncased-finetuned-sst-2-english"
- Description: This parameter specifies the pre-trained model to use for the sentiment analysis task.
- Function:
  - Model Name: "distilbert-base-uncased-finetuned-sst-2-english" is a specific model hosted on Hugging Face's Model Hub. It is a fine-tuned version of the DistilBERT model, which is a smaller and faster variant of BERT (Bidirectional Encoder Representations from Transformers).
  - Pre-trained Model: The model has been pre-trained on a large dataset and fine-tuned specifically for the SST-2 (Stanford Sentiment Treebank) dataset, which makes it well-suited for sentiment analysis tasks in English.
  - Usage: By specifying this model, you ensure that the pipeline uses the appropriate pre-trained weights and architecture for the sentiment analysis task, providing more accurate results than using a generic or unspecified model.

Example:

The specified model and task work together to analyze the sentiment of input texts. Here is how the pipeline utilizes these parameters:

Task: The pipeline knows it needs to perform sentiment analysis due to the "sentiment-analysis" task specification.
Model: It uses the distilbert-base-uncased-finetuned-sst-2-english model, which is optimized for understanding and classifying sentiment in English text.

Output generated

Text: I do not like to eat cake, but I like the smell Sentiment: POSITIVE, Score: 0.9992

Text: I like cake smell, but I do not like to eat it Sentiment: POSITIVE, Score: 0.5582

Text: I like the cake Sentiment: POSITIVE, Score: 0.9997

Text: I do not like the smell Sentiment: NEGATIVE, Score: 0.9969

Understand details about model

https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

The above page shows more details about the model. It contains details like

Model description
How to use the code
Training data used
Training Procedure (more details in next comment)

Discover models by task type

https://huggingface.co/models?pipeline_tag=text-generation&sort=trending

sentiment-analysis : is under Task type 'Text-Classification'

Vignana-Jyothi / kp-gen-ai

[Theory] Google Colab #12

Free Tier:

Colab Pro and Colab Pro+:

General Considerations:

1. Transformers

2. TensorFlow

3. PyTorch

4. Datasets

5. Numpy

6. Pandas

7. Matplotlib and Seaborn

8. scikit-learn

9. NLTK and spaCy

The `pipeline` function for initializing the sentiment analysis

Parameters:

Detailed Explanation:

Example:

Understand details about model

Discover models by task type

Vignana-Jyothi / kp-gen-ai

[Theory] Google Colab #12

Free Tier:

Colab Pro and Colab Pro+:

General Considerations:

1. Transformers

2. TensorFlow

3. PyTorch

4. Datasets

5. Numpy

6. Pandas

7. Matplotlib and Seaborn

8. scikit-learn

9. NLTK and spaCy

The pipeline function for initializing the sentiment analysis

Parameters:

Detailed Explanation:

Example:

Understand details about model

Discover models by task type

The `pipeline` function for initializing the sentiment analysis