iLearnPlus is the first machine-learning platform with both graphical- and web-based user interface that enables the construction of automated machine-learning pipelines for computational analysis and predictions using nucleic acid and protein sequences.
91
stars
33
forks
source link
A question about the PSTNPss function in util/FileProcessing.py #16
Hello, I am interested in your project, but I have some questions when reading your source code. I hope you can help me answer them.
My question is about the PSTNPss function in util/FileProcessing.py.
Question 1: I see that in this function, you subtract one from the total number of samples for the corresponding label and subtract one from the trinucleotide count at the corresponding location. I don’t understand the purpose and principle of doing this.
p_num, n_num = positive_number, negative_number
po_number = matrix_po[j][order[sequence[j: j + 3]]]
if i[0] in positive_key and po_number > 0:
po_number -= 1
p_num -= 1
ne_number = matrix_ne[j][order[sequence[j: j + 3]]]
if i[0] in negative_key and ne_number > 0:
ne_number -= 1
n_num -= 1
Question 2: Secondly, this function uses different processing methods for the training dataset and the testing dataset. In the training dataset, you perform the above subtraction operation, but not in the testing dataset. I don’t understand why there is such a difference. I have attached your code snippet for your convenience. Thank you for your time and help!
def PSTNPss(self):
try:
if not self.is_equal:
self.error_msg = 'PSTNPss descriptor need fasta sequence with equal length.'
return False
fastas = []
for item in self.fasta_list:
if item[3] == 'training':
fastas.append(item)
fastas.append([item[0], item[1], item[2], 'testing'])
else:
fastas.append(item)
for i in fastas:
if re.search('[^ACGT-]', i[1]):
self.error_msg = 'Illegal character included in the fasta sequences, only the "ACGT[U]" are allowed by this encoding scheme.'
return False
encodings = []
header = ['SampleName', 'label']
for pos in range(len(fastas[0][1]) - 2):
header.append('Pos.%d' % (pos + 1))
encodings.append(header)
positive = []
negative = []
positive_key = []
negative_key = []
for i in fastas:
if i[3] == 'training':
if i[2] == '1':
positive.append(i[1])
positive_key.append(i[0])
else:
negative.append(i[1])
negative_key.append(i[0])
nucleotides = ['A', 'C', 'G', 'T']
trinucleotides = [n1 + n2 + n3 for n1 in nucleotides for n2 in nucleotides for n3 in nucleotides]
order = {}
for i in range(len(trinucleotides)):
order[trinucleotides[i]] = i
matrix_po = self.CalculateMatrix(positive, order)
matrix_ne = self.CalculateMatrix(negative, order)
positive_number = len(positive)
negative_number = len(negative)
for i in fastas:
if i[3] == 'testing':
name, sequence, label = i[0], i[1], i[2]
code = [name, label]
for j in range(len(sequence) - 2):
if re.search('-', sequence[j: j + 3]):
code.append(0)
else:
p_num, n_num = positive_number, negative_number
po_number = matrix_po[j][order[sequence[j: j + 3]]]
if i[0] in positive_key and po_number > 0:
po_number -= 1
p_num -= 1
ne_number = matrix_ne[j][order[sequence[j: j + 3]]]
if i[0] in negative_key and ne_number > 0:
ne_number -= 1
n_num -= 1
code.append(po_number / p_num - ne_number / n_num)
# print(sequence[j: j+3], order[sequence[j: j+3]], po_number, p_num, ne_number, n_num)
encodings.append(code)
self.encoding_array = np.array([])
self.encoding_array = np.array(encodings, dtype=str)
self.column = self.encoding_array.shape[1]
self.row = self.encoding_array.shape[0] - 1
del encodings
if self.encoding_array.shape[0] > 1:
return True
else:
return False
except Exception as e:
self.error_msg = str(e)
return False
Hello, I am interested in your project, but I have some questions when reading your source code. I hope you can help me answer them.
My question is about the PSTNPss function in util/FileProcessing.py. Question 1: I see that in this function, you subtract one from the total number of samples for the corresponding label and subtract one from the trinucleotide count at the corresponding location. I don’t understand the purpose and principle of doing this.
Question 2: Secondly, this function uses different processing methods for the training dataset and the testing dataset. In the training dataset, you perform the above subtraction operation, but not in the testing dataset. I don’t understand why there is such a difference. I have attached your code snippet for your convenience. Thank you for your time and help!