SorenKarst / longread_umi

GNU General Public License v3.0
76 stars 29 forks source link

Empty umi_bin_map.txt file #10

Closed a-89 closed 4 years ago

a-89 commented 4 years ago

Hi everyone,

Many thanks for developing this pipeline and the UMI strategy, this is really promising! I am trying to implement the pipeline to analyze some UMI data we have been sequencing with Nanopore.

I am getting an empty umi_bin_map.txt file, although I have an umi1_map.sam and umi2_map.sam. However, I think none of the read_IDs maps both with umi1 and umi2, and most of the reads are not in any of the sam files.

I think the problem can come from earlier in the pipeline. I have used vsearch instead of usearch, which perform similarly. My UMIs have a maximum cluster size of 3 (only one, all of the others are 2). Is that normal? Could that end up with an empty umi_bin_map.txt? Few lines of umi_ref.txt:

umi1 CATTGTCGTATAACGGAC ACGCACAGCAAGTCGCTC yes umi1/umi1/umi1/umi1 3 umi2 GTTTGCTCTAGGGCGAAA CCATGTTGCACCACACGC yes umi2/umi2/umi2/umi2 2 umi3 CGGTGATATACGGCGGTG CCCCAGAGTAATATAATC yes umi3/umi3/umi3/umi3 2 umi4 CTTTACTTTGGTGCAGTG CCCCGACACAACCTATAA yes umi4/umi4/umi4/umi4 2 umi5 ATATAGCTCGCTGCGGGC TGCCGCTCCAGAATGATC yes umi5/umi5/umi5/umi5 2

I am a lost at this point; I have changed many things now and I think I messed up something. Any help would be really appreciated. Thanks in advance!

Best, Anna

SorenKarst commented 4 years ago

Hi Anna,

Thank you for trying out the pipeline, and great that you are testing vsearch as an alternative to usearch. I think many will be glad to have that option.

The very low cluster size i definitely a bad sign. The reasons can be many and either data or pipeline related, so I will ask you a few questions:

1) How long is the amplicon? 2) How much DNA template did you use for your UMI PCR and how many copies of you target do you expect to be in that amount of DNA? 3) How much Nanopore data did you generate and with what type of flow cell? 4) Did you test the pipeline with vsearch on the test dataset. Did you get the same results as with usearch?

Thanks, Soren

a-89 commented 4 years ago

Hi Soren,

Thanks for your rapid answer!

I am afraid I did not provide some of the essential details in the first message. We are generating amplicons from the 16S-23S region (same primers as in your article), so the length of the amplicons is something around ~4,500 bp. Our initial test samples were Zymobiomics mock community and a fecal sample.

When comparing to your approach, we changed the ncec primer for the ONT tag, though. Moreover, we used the PCR barcoding kit for performing a second PCR and barcoding the samples. That was probably a mistake since we got plenty of short sequences (primer dimer maybe). Is there a reason why you choose ncec_pcr primer?

The empty file (umi_bin_map.txt) is not empty anymore; I modified the length of the start and end (file: reads_tf_umi1/2.fa) to include the UMI sequence. Since I got my samples barcoded, using 70-80 bp from each end was not enough to obtain the UMI sequence. However, the consensus I am getting is not reliable with such a low cluster size.

I am afraid I cannot test usearch since we do not have this software available.

Thanks again. Best,

Anna

SorenKarst commented 4 years ago

Hi Anna,

Okay. With this level of customization, I think it is easier to troubleshoot if I take a look at the data if it is okay with you?

Can you upload the following to dropbox or similar and pm me the link?

Thanks!

Regards Søren

a-89 commented 4 years ago

Hi Søren, thanks for your answer and your time, how can I send you the link with all the information you have asked for? Not finding the private message option in Github. Thanks again!

SorenKarst commented 4 years ago

Hi Anna,

I took a quick look at the data, and I agree with your own assessment. The majority of the data is short reads, probably adaptor dimers, and therefore it is not a good starting point for troubleshooting. Making new libraries is probably a good idea.

Regarding the pipeline test. As you mentioned yourself, the pipeline seems to run fine up until medaka polishing, which means your Vsearch implementation works. This is great and definitely something we will work into the pipeline in the future. The medaka polishing probably fails due to the issue described here #8 . We will make a new release in the start of January that will fix this.

Regarding changing the amplification primers for Nanopore demultiplexing primers. I like the idea and in theory, it should work just as well as our setup. For your first test library I would recommend doing it without barcoding, to minimize complexity.

Regarding porechop, the modification is not critical, even if you use different primers. The primary purpose of using porechop is to utilize data present as concatenated amplicons, which we observed a lot when doing the Loman lab onepot ligation protocol. So skip this to begin with.

I will close this issue now, but feel free to reopen it or open new issues if you encounter problems with your new libraries.

Merry christmas :)

Regards Søren

a-89 commented 4 years ago

Again, many thanks for your time Søren. I am happy to see the vsearch results seem to be comparable to usearch ones! We will be repeating the libraries at the beginning of January, and hopefully, we will get better data to work with. I will try your options for solving the problem with medaka. Merry Christmas and happy new year :)

tdfy commented 4 years ago

Hey Anna, Could you elaborate on vsearch implementation? I am running on WSL and the usearch 32 bit requirement is holding up installation. Thank you! -Todd

thierryjanssens commented 2 years ago

Hello Ana,

could you explain how to use vsearch in the longread_umi code? Thank you !

kind regards,

Thierry