Is taking the complement of the negative strands a right thing to do?

zhanyuanucb commented 5 years ago

The following code taking the complement of a negative strand of an DNA sequence in 1a_producing_output_files_with_motif.ipynb looks suspicious:

for n in range(0, len(data)):
            # Extract the header information
            header = data[n].description.split('|')
            descr = data[n].description
            regionID = header[0]
            expressed = header[1]
            speciesID = header[2]
            strand = header[3]
            # Complement all sequences in the negative DNA strand
            if strand == '-':
                # Using the syntax [e for e in base_pairs[n]] to create a new pointer for each position
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq.complement()]])
            else:
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq]])

I checked the header information in one region sequence file:

# Inspect the fasta files in the input_folder_path

# data contains all 24 species's DNA sequence of one specific region
data = list(SeqIO.parse(input_folder_path + file,"fasta"))

one_species = data[0]  # take the first species of a given region sequence
print(one_species)

descr = one_species.description
header = one_species.description.split('|')
print('*******************************************')
print('regionID: ', header[0])
print('expressed: ', header[1])
print('speciesID: ', header[2])
print('strand: ', header[3])

# Print out the information as the following:
ID: VT1592|1|MEMB002A|-|2831
Name: VT1592|1|MEMB002A|-|2831
Description: VT1592|1|MEMB002A|-|2831
Number of features: 0
Seq('TCCGGCTGGCGGGGAATATTTTTTAAGCGATCGCACGTGCGACCCCCAAAACCT...AGC', SingleLetterAlphabet())
*******************************************
regionID:  VT1592
expressed:  1
speciesID:  MEMB002A
strand:  -

iamciera commented 5 years ago

No matter what we should not be taking the complement here. That negative is meaningless at this point. We need to update immediately. @zhanyuanucb, can you please modify this code? This is basically mapping the score to the complement, which will make a difference to the nucleotide TFBS match up. From my understanding of what the bidirectional architecture is doing, it shouldn't affect the model insanely, since the sequence is "correct" in a sense and the TFBS is also correct, but how they relate to each other will be greatly affected in half of the sequences. This could add to the "noise" we are seeing in the nucleotide information.

@adfost or @zhanyuanucb - Is there a way to see what the output sequence is, after the above loop is run? We really need to check this.

Also, the first thing we have to do is re-run everything without this line and compare with 5_map_motif_no_threshold_14Nov2018.zip. @adfost, do you have time to reformat? How long does this take if you know what you are doing?

zhanyuanucb commented 5 years ago

Should I just totally ignore negative strand?

adfost commented 5 years ago

I'm going to look into it right now.

adfost commented 5 years ago

I will try ignoring the negative strand.

iamciera commented 5 years ago

No. Don't subset out the negative strand. It doesn't mean negative strand. Just ignore that part of the header. It doesn't mean anything. Just remove that part of the header. All the strands are in the same direction in the fasta file.

DiscoveryDNA / team_neural_network

Is taking the complement of the negative strands a right thing to do? #32