Sarah111-AHM / Semsmah

2 stars 0 forks source link

Pandas ملخص الكتاب #42

Open Sarah111-AHM opened 1 year ago

Sarah111-AHM commented 1 year ago

Import pandas and numpy library

import numpy as np
import pandas as pd

هذا الكود يستدعي مكتبتي numpy و pandas في لغة البايثون.

numpy هي مكتبة للحسابات العلمية والتي تساعد على العمليات الرياضية والعلمية المتعلقة بالأرقام والبيانات المصفوفية.

pandas هي مكتبة للتحليل البياني والبيانات، والتي تساعد على العمل مع البيانات المصفوفية والجداول.

بعد ذلك، يتم تعريف اسماء الحزم التي سيتم استخدامها في الكود، ويتم استخدام كلمة "as" لتسمية الحزم بأسماء مختصرة لتسهيل الكتابة والقراءة.

وبمجرد استدعاء هذه المكتبات، يمكن استخدام العديد من الدوال والأدوات المتوفرة في هذه المكتبات للعمل على البيانات والحسابات العلمية.

data structures on pandas

1. Series Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series : s = pd.Series(data, index=index) data can be many different things:( dict , ndarray , scalar value (like 5) )

# The basic method to create a Series :
# s = pd.Series(data, index=index)
import pandas as pd 
s= pd.Series([1,2,3,4],["a","b","c","d"])
print (s)

هذا الكود يستخدم مكتبة pandas لإنشاء Series، وهي واحدة من الأشكال الأساسية لتخزين البيانات في pandas.

تم إنشاء Series جديدة باستخدام الدالة pd.Series() وتحديد البيانات (data) وفهرس (index) للسلسلة.

تم تحديد البيانات باستخدام قائمة [1, 2, 3, 4] وتحديد الفهرس باستخدام قائمة ["a", "b", "c", "d"]. وبالتالي ، ستحتوي السلسلة على 4 عناصر مع تحديد الفهرس الخاص بكل عنصر.

تم طباعة السلسلة باستخدام الدالة print() للعرض على الشاشة.

ستظهر السلسلة على هذا النحو:

a    1
b    2
c     3
d    4
dtype: int64

وتظهر الفهارس على الجانب الأيسر من النتائج، والبيانات على الجانب الأيمن من النتائج. ويتم تمثيل نوع البيانات باستخدام dtype: int64.

NOTE:

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1]. NDARRAY

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
s.index

OR

s = pd.Series(np.random.randn(5),['a', 'b', 'c', 'd', 'e'])
print(s)
#If an index is not specified, it will put incremental numbers Automatic
s = pd.Series(np.random.randn(5))
print (s)
s.index

NOTE ( It's not important) :

pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used). DICT

s = {'b': 1, 'a': 0, 'c': 2}
pd.Series(s)

NOTE:

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and Pandas version >= 0.23. If you’re using Python < 3.6 or Pandas < 0.23, and an index is not passed, the Series index will be the lexically ordered list of dict keys.

In the example above, if you were on a Python version lower than 3.6 or a Pandas version lower than 0.23, the Series would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).

s = {'a': 0., 'b': 1., 'c': 2.}
pd.Series(s)
pd.Series(s, index=['b', 'c', 'd', 'a'])

NOTE :

NaN (not a number) is the standard missing data marker used in pandas.

scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
print(s)
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
print(s)

operations Series is ndarray-like slicing will also slice the index

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print()
# It returns the value of the first value according to its index, as the numbering of places in pandas starts from 0 and not 1
print(s[0])
print()
# It returns several values, as this sign means [3:], meaning it returns from the first to the index before the written number, meaning it returns from index 0 to 2, not to 3 (meaning before one does not include the last number in the slicing
print(s[:3])
print()
# Series.median() function to find the median of the given String object.
print(s.median())
# It compares the values ​​of s with the average value produced by the following function Series.median() Returns a boolean value that is true when the value of s is greater than the mean value and false when exactly the oppositeIt compares the values ​​of s with the average value produced by the following function Series.median() Returns a boolean value that is true when the value of s is greater than the mean value and false when exactly the opposite
print(s > s.median()) 
# Here you return the values ​​that had a true output
print(s[s > s.median()])
print()
# Here you return the values ​​that have the index as follows: 4 means e, 3 means d, and 1 means b. Here, use the automatic index in the call.
print(s[[4, 3, 1]]) # or print(s[['e','d','b']])
print(s[['e','d','b']])
print()
# The mathematical function used to calculate the exponential for all elements 
print(np.exp(s))

dtype pandas like numpy array

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s.dtype
s= pd.Series([1,2,3,4],["a","b","c","d"])
s.dtype
s = pd.Series({97:'a', 98:'b', 99:'c', 100:'d', 101:'e', 102:'f'})
s.dtype # data type is object 
dates = pd.date_range('2021-06-01', periods=5, freq='D')
s = pd.Series(pd.date_range('2021-06-01', periods=5, freq='D'))
print (s)
print(s.dtype)

Converting from pandas to a numpy array If you need the actual array backing a Series, useSeries.array. Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().

s = pd.Series(pd.date_range('2021-06-01', periods=5, freq='D'))
print(s.array)
print()
print(s.to_numpy)
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s.array)
print()
print(s.to_numpy)
s = pd.Series({97:'a', 98:'b', 99:'c', 100:'d', 101:'e', 102:'f'})
print(s.array)
print()
print(s.to_numpy)
s= pd.Series([1,2,3,4],["a","b","c","d"])
print(s.array)
print()
print(s.to_numpy)

Series is dict-like can get and set values by index label:

s= pd.Series([1,2,3,4],["a","b","c","d"])
print(s)
print()
#Return the value of the index (a)
print(s['a'])
print()
# Change the value of the index (e) to 12
s['e'] = 12
print(s)
print()
# It checks whether the value to its index (e) is present in the series (s). If it is present, it returns true, and if it does not exist, it returns false.
print('e' in s)
print()
#Same idea as before
print('f' in s)
print()

Vectorized operations and label alignment with Series

s= pd.Series([1,2,3,4],["a","b","c","d"])
print(s)
# Combine the series(s) with series(s)
print(s + s)
print()
# Hit the number 2 in the series(s)
print(s * 2)
print()
# The mathematical function used to calculate the exponential for all elements 
print(np.exp(s))
print()
print(s[1:])
print(s[:-1])
print(s[1:] + s[:-1])

Name attribute

s = pd.Series(np.random.randn(5), name='RANDOM_SERIES')
print(s)
s.name
s = s.rename("different")
s.name

2. DataFrame DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

DataFrame accepts many different kinds of input:

  1. Dict of 1D ndarrays, lists, dicts, or Series
  2. 2-D numpy.ndarray
  3. Structured or record ndarray
  4. A Series
  5. Another DataFrame dict of Series
    s = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
    s = pd.DataFrame(s)
    print(s)
    pd.DataFrame(s, index=['d', 'b', 'a'])
    pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
    s.index
    s.columns

    From dict of ndarrays / lists

    s = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
    pd.DataFrame(s)
    pd.DataFrame(s, index=['a', 'b', 'c', 'd'])

    From structured or record array

    s = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
    s[:] = [(1, 2., 'Hello'), (2, 3., "World")]
    pd.DataFrame(s)
    pd.DataFrame(s, columns=['C', 'A', 'B'])

    From a list of dict

    s = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
    pd.DataFrame(s)
    pd.DataFrame(s, index=['first', 'second'])
    pd.DataFrame(s, columns=['a', 'b'])

    From a dict of tuples

    pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

    From a Series The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

From a list of dataclasses

from dataclasses import make_dataclass
Point = make_dataclass("Point", [("x", int), ("y", int)])
pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])

Missing data

To build a DataFrame with the missing data, we use np.nan to represent the missing values.

Object creation